Until now, IT leaders have needed to ponder the cyber security risks posed by allowing clients to entry big language fashions (LLMs) like ChatGPT immediately by means of the cloud. The selection has been to utilize open source LLMs that could be hosted on-premise or accessed by means of a private cloud.
The factitious intelligence (AI) model should run in-memory and, when using graphics processing fashions (GPUs) for AI acceleration, this suggests IT leaders need to ponder the costs associated to purchasing banks of GPUs to assemble up ample memory to hold the entire model.
Nvidia’s high-end AI acceleration GPU, the H100, is configured with 80Gbytes of random-access memory (RAM), and its specification reveals it’s rated at 350w in terms of vitality use.
China’s DeepSeek has been able to show that its R1 LLM can rival US artificial intelligence with out the need to resort to the latest GPU {{hardware}}. It does, nonetheless, revenue from GPU-based AI acceleration.
Nonetheless, deploying a private mannequin of DeepSeek nonetheless requires vital {{hardware}} funding. To run the entire DeepSeek-R1 model, which has 671 billion parameters in-memory, requires 768Gbytes of memory. With Nvidia H100 GPUs, which can be configured with 80GBytes of video memory card each, 10 could be required to verify the entire DeepSeek-R1 model can run in-memory.
IT leaders may successfully be succesful to barter amount reductions, nonetheless the value of merely the AI acceleration {{hardware}} to run DeepSeek is spherical $250,000.
A lot much less extremely efficient GPUs could be utilized, which might help to chop again this decide. Nonetheless given current GPU prices, a server capable of working your entire 670 billion-parameter DeepSeek-R1 model in-memory goes to worth over $100,000.
The server is likely to be run on public cloud infrastructure. Azure, for instance, offers entry to the Nvidia H100 with 900 GBytes of memory for $27.167 per hour, which, on paper, ought to easily be succesful to run the 671 billion-parameter DeepSeek-R1 model completely in-memory.
If this model is used every working day, and assuming a 35-hour week and 4 weeks a yr of holidays and downtime, the annual Azure bill could be just about $46,000 a yr. As soon as extra, this decide is likely to be diminished significantly to $16.63 per hour ($23,000) per yr if there is a three-year dedication.
A lot much less extremely efficient GPUs will clearly worth a lot much less, but it surely certainly’s the memory costs that make these prohibitive. For instance, looking at current Google Cloud pricing, the Nvidia T4 GPU is priced at $0.35 per GPU per hour, and is obtainable with as a lot as 4 GPUs, giving an entire of 64 Gbytes of memory for $1.40 per hour, and 12 could be needed to swimsuit the DeepSeek-R1 671 billion-parameter model entirely-in memory, which works out at $16.80 per hour. With a three-year dedication, this decide comes all the way in which right down to $7.68, which works out at barely under $13,000 per yr.
A cheaper technique
IT leaders can reduce costs further by avoiding pricey GPUs altogether and relying completely on general-purpose central processing fashions (CPUs). This setup is admittedly solely acceptable when DeepSeek-R1 is used purely for AI inference.
A contemporary tweet from Matthew Carrigan, machine learning engineer at Hugging Face, suggests such a system is likely to be constructed using two AMD Epyc server processors and 768 Gbytes of fast memory. The system he supplied in a sequence of tweets is likely to be put collectively for about $6,000.
Responding to suggestions on the setup, Carrigan acknowledged he is able to acquire a processing price of six to eight tokens per second, counting on the actual processor and memory tempo that is put in. It moreover depends upon the dimensions of the pure language query, nonetheless his tweet incorporates a video exhibiting near-real-time querying of DeepSeek-R1 on the {{hardware}} he constructed based mostly totally on the dual AMD Epyc setup and 768Gbytes of memory.
Carrigan acknowledges that GPUs will win on tempo, nonetheless they’re pricey. In his sequence of tweets, he components out that the amount of memory put in has a direct impression on effectivity. That’s on account of method DeepSeek “remembers” earlier queries to get to options sooner. The strategy is known as Key-Value (KV) caching.
“In testing with longer contexts, the KV cache is unquestionably larger than I realised,” he acknowledged, and immediate that the {{hardware}} configuration would require 1TBytes of memory in its place of 76Gbytes, when massive volumes of textual content material or context is pasted into the DeepSeek-R1 query speedy.
Searching for a prebuilt Dell, HPE or Lenovo server to do one factor comparable is extra more likely to be considerably dearer, counting on the processor and memory configurations specified.
A particular method to deal with memory costs
Among the many many approaches that could be taken to chop again memory costs is using plenty of tiers of memory managed by a custom-made chip. That’s what California startup SambaNova has completed using its SN40L Reconfigurable Dataflow Unit (RDU) and a proprietary dataflow construction for three-tier memory.
“DeepSeek-R1 is probably going one of the vital superior frontier AI fashions obtainable, nonetheless its full potential has been restricted by the inefficiency of GPUs,” acknowledged Rodrigo Liang, CEO of SambaNova.
The company, which was based mostly in 2017 by a gaggle of ex-Photo voltaic/Oracle engineers and has an ongoing collaboration with Stanford Faculty’s electrical engineering division, claims the RDU chip collapses the {{hardware}} requirements to run DeepSeek-R1 successfully from 40 racks right down to 1 rack configured with 16 RDUs.
Earlier this month on the Leap 2025 conference in Riyadh, SambaNova signed a deal to introduce Saudi Arabia’s first sovereign LLM-as-a-service cloud platform. Saud AlSheraihi, vice-president of digital choices at Saudi Telecom Agency, acknowledged: “This collaboration with SambaNova marks an enormous milestone in our journey to empower Saudi enterprises with sovereign AI capabilities. By offering a secure and scalable inferencing-as-a-service platform, we’re enabling organisations to unlock the entire potential of their information whereas sustaining full administration.”
This address the Saudi Arabian telco provider illustrates how governments need to ponder all decisions when developing out sovereign AI functionality. DeepSeek demonstrated that there are numerous approaches that could be merely as environment friendly as a result of the tried and examined strategy of deploying immense and expensive arrays of GPUs.
And whereas it does actually run increased, when GPU-accelerated AI {{hardware}} is present, what SambaNova is claiming is that there is moreover one other method to acquire the similar effectivity for working fashions like DeepSeek-R1 on-premise, in-memory, with out the costs of attending to accumulate GPUs fitted with the memory the model needs.