Since the publication of the now famous 2017 paper Attention is All You Need1, many large language models based on the transformer architecture have e

Transformers Dashboard 🤖📈

submited by

Style Pass

2024-05-05 00:30:05

Since the publication of the now famous 2017 paper Attention is All You Need1, many large language models based on the transformer architecture have emerged. Fortunately, some studies 2 3 have compiled extensive data on many published models, including the dimensions of their transformers.

Much like my experience learning about CNNs and their growth in complexity, I wanted to analyze LLM transformers. Which models are the largest? What is the optimal size for the feed-forward layer? Is it better to add more embeddings or more attention heads? Can we easily derive the total number of parameters from the network dimensions?

$$ \begin{aligned} P_{\textrm{MHA}} &= h (2d_{\textrm{model}}d_k + 2d_k + d_{\textrm{model}}d_v + d_v) + hd_vd_{\textrm{model}} + d_{\textrm{model}} \\ &= 4 (d_{\textrm{model}}^2 + d_{\textrm{model}}) \end{aligned} $$

Encoder : the encoder has one MHA and one FFN. Each one has a norm layer. $$ P_{\textrm{encoder}} = N (P_{\textrm{MHA}} + P_{\textrm{FFN}} + 2P_{\textrm{LN}} )$$