DeepSeek has captivated the world’s attention in the last few weeks.
If, by chance, you’ve been living under a rock, the value proposition of the model is:
- Train model on par with the best-in-class.
- Produce it at a fraction of the cost (a mere $6 million!)
- Open source the model so it’s free to use.
The reception has been swift and intense. Nvidia’s stock tanked 17%, or $600 billion (a.k.a. 100K DeepSeeks) on fears of decreased chip demand. Marc Andreessen, the famed investor, wrote that DeepSeek is “AI’s Sputnik moment“, as the Chinese company brings fresh competition to the primarily U.S.-dominated space. DeepSeek swiftly became the #1 app on the Apple Store and the Google PlayStore.
Nvidia, in unhappy times.
So let’s talk about the technological innovation here, and whether or not the $6M train cost holds water.
What is the hype around this technology?
Top-level metrics for DeepSeek-R1, its smaller, distilled version DeepSeek-R1-32B, and competitors on various benchmarks.
DeepSeek published their paper “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” As you can see from the chart above, DeepSeek’s latest model performs as well as the best models from competitors, in particular OpenAI. These benchmarks test on math and science, but also the humanities and language.
(Importantly for any engineers in the room, “SWE-bench Verified” is a test of Issue-Pull Request pairs, i.e. it tests bug fixing. Even the best models in the world still utterly flunk this metric, so your job is still safe for now.)
There are two primary ways to think about cost – cost to train and cost to the consumer. We’ll ignore actual cost to serve for now since companies typically do not release these figures. The authors of the paper write, “Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M.” By comparison, GPT4 cost ~$100M to train.
For consumer costs, this model is quite cheap, too.
The chat model comes in at about a penny per million input tokens. By comparison, OpenAI’s GPT4 pricing model is $2.50/1M tokens the latest model. In reality, however, essentially all major players have cheaper versions of their models. Google’s Gemini 1.5 Flash, for example, is only $0.08 per 1M input tokens, and $0.28 per 1M output tokens. Meta’s Llama team has, of course, maintained an open source model that’s free to use, excluding vendor costs. Still, for the top of the line results, and the ability to download the model, it’s cheap.
How did they do it?
There are three key innovations that led to this massively increased efficiency.
First, the researchers started with the DeepSeek-R1-Zero model (the previous best effort). This model uses a clever, but well-known, architecture called Mixture-of_experts (MoE). In this architecture, you train several nets, and a router which sends queries to the top k best experts for that task. It is computationally efficient because only a few experts need to be “awake” for answer a given query. DeepSeek-R1-Zero tweaks this approach by doubling the number of experts, but making each expert smaller to keep total model size down.
Mixture of Experts model, figure taken from DeepGram.
Clever architectural choice or no, the quest remained of how to actually efficiently train this model. The secret sauce to virtually all LLMs these days is Reinforcement Learning with Human Feedback (RLHF). In this setup, human experts (expensive) are used to create high-quality datasets of prompts, responses, and also rank order possible model output. These data are used to fine-tune the model with supervised learning, as well as to train a reward model. This architecture necessarily means you have to have an additional (pretty large) reward model. Thus, the total compute needed to train is large in traditional RLHF.
Typical architecture associated with RLHF, illustration from HuggingFace.
DeepSeek researchers asked: What if we could just use pure RL and bootstrapped scoring rules to cut down costs?
They started by applying some clever math to the problem.
Group Relative Policy Optimization (GRPO)
GRPO is a workaround to use pure reinforcement learning, which is cheap, to update the policy. It is quite similar to PPO, used in RLHF. There’s a lot going on in the equation above, but the basics are:
- Take the expectation, which is the anticipated future reward the agent can expect to receive by taking some action A.
- Take the weighted average of how the model responded with the old policy vs. the new policy (multiplied by a standardization value with the confusing notation Ai).
- Clip the change so it doesn’t change too much on any one update (like installing bumpers in a bowling alley so your ball doesn’t careen off the edge).
In order to score the reward, they had two parts:
- Accuracy reward
- Math and logic problems are scored for correct answers
- Other questions with unknown answers, such as a new coding problem, are scored by how much the model answer follows rules – i.e. syntax, grammar, etc.
- Format reward
- This was a hilariously simple rule: “Put your thinking process between tags ‘<think>’ and ‘</think>’”
- The intention was to reward “Chain of thought” reasoning. Much like a middle school math class, the model had to show its work to receive credit.
This worked great! For the most part. The model started to perform at or near OpenAI’s model. It even started to verbalize insights similar to a student, correcting its own mistakes.
The aha! moment from the DeepSeek chain of thought reasoning.
However, the model still had poor readability and language mixing (often, English and Chinese in the same response).
Enter Cold Start DeepSeek
To solve these issues, the researchers attempted to nudge the model in the right direction via cold start data and tweaked rewards. Specifically, they gave a language consistency reward (please only speak one language, +1 points). Next, they performed the following:
1) Use thousands of high-quality Chain of Thought datapoints for initial training (Supervised Fine Tuning), focusing on well-defined math/science tasks
2) GRPO-based Reinforcement learning (as before). Wait until reasoning converges from pure RL and save checkpoint.
3) Collect 600k reasoning related training samples from the model for supervised fine-tuning. Keep only the correct ones.
4) Use curated reward models in place of RLHF for “human-centered” responses and alignment.
Ok, there is a lot of unexplained magic going on here, and the paper is unfortunately short on details. First, it is unclear how the 600k responses were collected, and incorrect ones automatically discarded. Second, they state: “We resort to reward models to capture human preferences in complex and nuanced scenarios.” In other words, we know diddly squat about their bootstrapped reward model to make human delightful responses emerge. This is a good time to point out that an open source model != open source code or data.
Shrinking the Model via Distillation
Lastly, they distilled the model into a smaller version. Distillation preserves reasoning, but makes the model smaller. It works by starting with a large teacher model. The teacher generates many training examples (prompt-answer pairs), the student learns to be almost as good as the teacher with less memory and compute. Intuitively, you can think of this as a professor who spent 6 years getting a PhD. A student takes a 1 semester course on Medieval Tavern Music and get an A on the final exam.
Illustration of model distillation, via https://arxiv.org/abs/2006.05525
DeepSeek directly fine-tuned open-source models like Qwen and Llama using the 800k samples curated with DeepSeek-R1. The newly resulting, smaller model performed quite well, matching up to performance with much larger GPT and Claude models.
DeepSeek-R1-Distill, the “mini” models, perform similarly to larger models.
But is the $6M cost real?
The authors note “The aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.” In other words, they exclude from their calculation the majority of model train cost! This is somewhat like saying, hey, I’m a new car company and I made a brand new, never before seen car with only $1,000 total. In reality, this ignores all the R&D that must go into that car, equipment to build and test the car, etc.
As Andrej Karpathy, cofounder of OpenAI, and oft-cited source of pithy AI quotes, tweeted:
In other words, impressive though the research is, it doesn’t tell the full story. Many external sources have tried to verify the cost claims. Stanford research, based on the available evidence, suggests a bare minimum of $130M, based on purchased compute. Other estimates suggest north of $1B. Still others claim $500M for hardware. It’s impossible to say until a third party attempts to replicate this result.
What is definitely true is that the $6M claim, taken as a headline, is misleading. Nobody can train a model start to finish with that amount. The paper’s approach, however, does seem reasonable, and anyone can test the model itself on the benchmarks. It is also difficult to compare to others, as there is no standard for reporting cost metrics (e.g. OpenAI’s $100M to train GPT4 likely also doesn’t include all train and data costs). Based on the Stanford analysis, a reasonable rough estimate is that the model train is half the cost of similar models. So, not 1/250th, but still a substantial improvement.
I should add the caveat that OpenAI has come out and claimed that DeepSeek stole its data to train its model. While the research paper certainly leaves many holes about how the data were generated (the details of the 600k datapoints, and the mystery “human-centric” reward function), these claims cannot be verified.
Race to the Cheap
So, where does this leave us? It is absurd to think that demand for compute is dead. That is almost certainly hype, and is supported by the massive spend announced this week by Google, Meta, and others. Even efficiently trained models can be enhanced with more compute, as well as more data.
Where it does leave us is that the landscape of ML has changed. Before, companies were mostly competing on model output (i.e. who has the most accurate model). The trend has been towards more data, with gargantuan parameter sizes toted out boastfully. Now, models must necessarily compete on efficiency, both in terms of cost and as a metric of prestige. We haven’t seen the last of efficiency innovations, as research continues to do more with less.

