I’ve long been fascinated by gremlins in my text. In fact, a very long time ago I was interested in the phenomenon. Recently I’ve seen thi

Rooting out Gremlins

submited by
Style Pass
2024-10-21 03:30:04

I’ve long been fascinated by gremlins in my text. In fact, a very long time ago I was interested in the phenomenon. Recently I’ve seen this creep up twice and figured it was about time I learned to explain how we get these garbled junk characters. You’ve seen them before; have you wondered why they exist?

In short, when a system confuses character encodings or applies special rules from the HTML specification, there are 27 characters which cause all the funny business. Read on for the whole story.

This may be old hat to you, so feel free to skip ahead, but a short refresher on character encodings may help here. Computers have different ways of storing text. Almost all of these methods involve a table lookup, where a whole number represents an index into that table for a specific letter. For example, the letter A is found at index 65 in almost all character mapping tables. These tables are called code pages.

Now computers also have different ways of storing numbers. A whole number can be represented in many ways too. When the code pages were small, each whole number in the code page fit into a single byte with room to spare. But as we started storing more and more characters we quickly ran out of room. For example, it’s not possible to store the whole number index 3418 in a single byte.

Leave a Comment