Tokenization in large language models, explained

submited by

Style Pass

2025-08-08 08:00:03

In April, paying subscribers of the Counterfactual voted for a post explaining how tokenization works in LLMs. As always, feel free to subscribe if you’d like to vote on future post topics; however, the posts themselves will always be made publicly available.

Large Language Models (LLMs) like ChatGPT are often described as being trained to predict the next word. Indeed, that’s how I described them in my explainer with Timothy Lee.

The Counterfactual is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

But that’s not exactly right. Technically, modern LLMs are trained to predict something called tokens. Sometimes “tokens” are the same thing as words, but sometimes they’re not—it depends on the kind of tokenization technique that is used. This distinction between tokens and words is subtle and requires some additional explanation, which is why it’s often elided in descriptions of how LLMs work more generally.

But as I’ve written before, I do think tokenization is an important concept to understand if you want to understand the nuts and bolts of modern LLMs. In this explainer, I’ll start by giving some background on what tokenization is and why people do it. Then, I’ll talk about some of the different tokenization techniques out there, and how they relate to research on morphology in human language. Finally, I’ll discuss some of the recent research that’s been done on tokenization and how that impacts the representations that LLMs learn.

Tokenization in large language models, explained

Leave a Comment

Related Posts

Recent Posts

McCulloch v. Maryland (1819)

Why blow up satellites when you can just hack them?

Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet

¯\_(ツ)_/¯ Logan's Site

PRESS RELEASE: Lyten to Acquire All Remaining Northvolt Assets in Sweden and Germany

It’s Beginning to Smell a Lot Like Stagflation

Hi, I'm Aditya Athalye. I make, learn, teach here.

Search code, repositories, users, issues, pull requests...

GPT-5 to power ChatGPT integration in Apple Intelligence on iOS 26, iPadOS 26, and macOS Tahoe

Instantly Turn Ideas IntoSysML Models

UK secretly allows facial recognition scans of passport, immigration databases

Why is History Crucial to Politics? - Romila Thapar

36 billion solar masses: Cosmic Horseshoe galaxy harbors what may be the most massive black hole ever detected

An Illinois bill banning AI therapy has been signed into law

New 3D Golf Simulation (video game series)

AI pilots can be a nightmare - by Arnon - Paid

Checkpointing CUDA Applications with CRIU

Abusing Ubuntu 24.04 features for root privilege escalation

Tesla shuts down Dojo, the AI training supercomputer that Musk said would be key to full self-driving

One Event at a Time: Funding Your Community the Realistic Way