Recently Meta has released LLaMA model that surprised many people - it packed a lot of magic in a small size. The 12b version was comparable with OpenAI’s GPT-3 largest 175B model in quality.
BigCode released the StarCoder model that hits 30.4% on HumanEval pass@1, and they also released a code dataset cleaned of personally identifiable information, called The Stack.
Replit recently released the replit-code-v1-3b model trained on code that follows some LLaMA innovations and it shows great metrics, but it has no fill-in-the-middle capability, no diffs, and it has seen no data other than code.
The number one thing about LLaMA is that it was trained for 1T tokens (and larger models for 1.4T tokens). But that alone is not enough, the transformer architecture and hyperparameters must be right to continue training for that long.
Architecture: LLaMA doesn’t have the bias terms in self-attention and in MLP - that probably allows weight decay to work better. Self-attention runs independently from MLP, not sequentially - this makes calculations a bit faster because they don’t have to wait for each other. LLaMA also uses RMSNorm instead of LayerNorm, but that shouldn’t be important.