I don’t read every paper they put out obviously, but this one really caught my eye. The title of the paper is, “ByT5: Towards a token-free

ByT5: What It Might Mean For SEO

submited by

Style Pass

2021-06-20 23:00:05

I don’t read every paper they put out obviously, but this one really caught my eye. The title of the paper is, “ByT5: Towards a token-free future with pre-trained byte-to-byte models”. It’s the “token-free” that drew me in.

A token in Natural Language Processing is a representation of a word, word segment (subword) or character. When text is being processed, a tokenizer breaks that text into tokens, so those tokens can be processed by the system with historically higher efficiency that processing the same text character-by-character.

Some words require multiple tokens. For example the word “playing” might have one token for “play” and one for “ing” as “ing” has a unique meaning in NLP, and this also keeps the number of tokens needed for a language under control.

To give you an idea of how token length impacts what can be done, in models like BERT, tokens have an upper limit of 512 to be processed at once before the compute costs becomes too high to be functional.