Tokens for LLMs: Byte Pair Encoding in Go

submited by

Style Pass

2024-04-25 14:30:04

A basic unit of currency in modern LLMs is the token; exciting new models have long context windows of millions of tokens. API pricing for the large providers is per-token. We're even seeing the invention of new, derived units like TPM (tokens per minute).

This OpenAI help article tells us that tokens are pieces of words, and gives some useful rules of thumb like a token being equivalent to approximately 4 characters or 3/4 of a word for the English language.

In this post I want to review the most commonly used algorithm for splitting text into tokens, provide a complete implementation in Go, and show a playground for experimenting with it. While my implementation isn't tuned for speed, it aims to be complete, readable and compatible with OpenAI's tiktoken library, generating identical results and working with the same vocabulary files.

Byte pair encoding (BPE) is an algorithm originally designed for data compression. A 2016 paper suggested re-purposing it for "word segmentation" for machine learning tasks. The colloquial term for word segmentation is tokenization.