Yashovardhan Srivastava | A genius, shy and broke bloke.

submited by

Style Pass

2025-01-15 18:00:08

Foreword: The title is a clickbait. I don’t actually know how to scale attention to serve billion users, as I feel it is a really complicated problem with a lot of moving parts and optimizations to keep in mind, but in this blog I’m going to explore one of the approaches which I find really interesting. I got the idea to write this blog after watching Horace He’s talk with Jane Street. I hope I was able to do it justice. I’ve also linked resources which I referred to while piecing together this blog. Get a cup of coffee, sit in a nice place, and enjoy this blog.

“Attention is all you need” was a pivotal paper that marked the revolution in the AI industry. All of the breakthroughs that we see today in the AI space can be traced back to that infamous paper. The authors of that paper are really influential too, but that’s a story for another blog.

The key idea introduced in the paper, in the development of transformer architecture was that of scaled dot product attention and self attention. For each input sequence, three vectors are generated dynamically, namely queries(Q), keys(K) and values(V) which allows the model to focus on different parts of the input. These three vectors make one “head” of attention. The scores are calculated as: