DeepSeek had a big release last week with their DeepSeek-v3 671B parameter MoE model, alongside an excellent paper detailing a wide array of system optimizations. Among these, their efficient training economics, success with FP8, and network-aware MoE stood out to me most. We will use Meta’s Llama 3 405B as a main reference point for the analysis12, as one of the most recent and largest other open-paper/architecture models with comparable quality. Just to be clear, I have no affiliation with either party and such a post is only enabled by extremely high-quality work and documentation by both parties. Let’s dive in!
In what follows, when we say “pre-training” we will refer strictly to the initial pre-training stage- we omit DeepSeek’s two subsequent context-extension stages (to 32K, then to 128K) and Llama’s context extension (to 128K) and learning rate annealing stages; in both cases, the combined subsequent stages took up ≈5% of training cost/token count. Also, FLOPs should be read as “floating point operations” and FLOP/s as “floating point operations per second”. Lastly, for FLOP calculations, I assume both parties use attention kernels with causal masking, so the number of operations is halved compared to full attention.
Let’s compare pre-training cost in GPU hours. For DeepSeek, this is reported as 2.7M H800 hours. For Llama, we can back it out: with 8K sequence length, 1 token needs ≈2.54 TFLOPs3 so 15.7T tokens needs 41.4T TFLOPs (double T); estimating their average BF16 MFU as 42% with H100 peak BF16 TFLOP/s as 990 gives 416 TFLOP/s; so we get 27.6M H100 hours. Note this may be a slight overestimate, as Llama masks attention between documents in the same 8K window, but even 1K average document length reduces total cost by only ≈4%, as FFN size > hidden size > 8K window » 1K window for Llama.