Summoning Cthulhu by Parsing HTML with Regular Expressions

submited by
Style Pass
2021-05-27 03:00:02

Regular expressions are an incredibly powerful tool. It is no wonder that so many people have flocked to internet forums to ask, “can I use regex to parse HTML”? According to this legendary Stack Overflow post, the answer is a resounding no. In fact, the post implies that attempting to parse HTML with regex will summon a Lovecraftian entity known as Z̷͎̐a̶͉̓l̵̟͛g̷̺̐ȏ̸̙, and should therefore be avoided at all costs.

“ A mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming.”

Numerous reasons are given in the Stack Overflow thread for why parsing HTML with regex is a no-go. Intimidating terminology such as “Chomsky Type 2 grammar” is thrown around. They claim that it is “mathematically impossible”, with one reply comparing it to dividing by zero. Looks like there’s nothing more to see here. After all, if the top replies on Stack Overflow say it’s impossible, then it must be impossible, right?

Don’t get me wrong. Using regex to parse HTML is a very, very bad idea. However, the reasoning in that Stack Overflow thread is not entirely sound in the context of modern programming languages. In this blog post, I’ll explain why, in some programming languages, if you really really wanted to, you could write a regex to match HTML.

Leave a Comment