Right to Left (R2L) Integer Tokenization

submited by
Style Pass
2024-07-10 08:30:05

By some twist of fate, this blog has become the chronicle of the evolution of integer tokenization. In an earlier post in February of 2023, it was discussed how older models: GPT-2 and GPT-3 tokenized integers by naively and directly applying Byte Pair Encoding (BPE) to numbers. It was shown how bizarre and arbitrary the results of such a process were. The GPT-2 and GPT-3 tokenizer assigned a large number of integers their own tokens, while other numbers were split arbitrarily. Because of the BPE tokenization scheme, the model assigned unique tokens to commonly occuring numbers such as small numbers, as well as recent years (1930-2020), and common longer values 10000 as their own token. An implication of this number partitioning scheme is that the model is forced to learn the associations and rules of arithmetic again and again with varying amounts of data and simply memorize a large number of arithmetic operations involving unique tokens.

In a more recent post this May, we studied the integer tokenization of more recent models and found that it was much less insane. This is because most more recent models (GPT-3.5 onwards) have moved away from pure BPE and converged to one of two strategies:

Leave a Comment