From Bigram to Infinigram

submited by
Style Pass
2024-05-04 20:00:08

I have been considering a lot more how n-grams are amazing tools to show some of the fundamentals of language models. With the new paper of infini-gram I think they are becoming more relevant again. So I decided to write an explanation on them.

Statistical Language Models can be represented as the probability of the last n-tokens times the current token 1. More formally:

Bigrams are the special case of an n-gram that is constrained to be based on only the previous token to predict the next one.

To train a bigram, you first generate a vocabulary of unique words from the corpus that you are using and then count all of the pairs of words that happen in it. Finally, you estimate the bigram probabilities by dividing the count of each bigram by the count of the first word in it.

As you can see, there are a lot of zeros in the matrix. And as you scale up n-grams to higher dimensions like pentagrams2 you are going to make really sparse matrices that take incredible amounts of space. Let’s not forget that the curse of dimensionality in language bites really hard and pretty fast for these statistical models.

Leave a Comment
Related Posts