A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?

submited by
Style Pass
2024-06-07 13:00:03

Here’s a fact: GPT-4o charges 170 tokens to process each 512x512 tile used in high-res mode. At ~0.75 tokens/word, this suggests a picture is worth about 227 words—only a factor of four off from the traditional saying.

(There’s also an 85 tokens charge for a low-res ‘master thumbnail’ of each picture and higher resolution images are broken into many such 512x512 tiles, but let’s just focus on a single high-res tile.)

OK, but why 170? It’s an oddly specific number, isn’t it? OpenAI uses round numbers like “\$20” or “\$0.50” in their pricing, or powers of 2 and 3 for their internal dimensions. Why choose a numbers like 170 in this instance?

Numbers that are just dropped into a codebase without explanation are called “magic numbers” in programming, and 170 is a pretty glaring magic number.

And why are image costs even being converted to token counts anyway? If it were just for billing purposes, wouldn’t it be less confusing to simply list the cost per tile?

Leave a Comment