Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

submited by

Style Pass

2021-05-29 14:30:03

This is a tale of two approaches to regular expression matching. One of them is in widespread use in the standard interpreters for many languages, including Perl. The other is used only in a few places, notably most implementations of awk and grep. The two approaches have wildly different performance characteristics:

Let's use superscripts to denote string repetition, so that a?³a³ is shorthand for a?a?a?aaa. The two graphs plot the time required by each approach to match the regular expression a?nan against the string an.

Notice that Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. The Perl graph plots time in seconds, while the Thompson NFA graph plots time in microseconds: the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages. A more detailed graph later in this article presents data for other implementations.)

It may be hard to believe the graphs: perhaps you've used Perl, and it never seemed like regular expression matching was particularly slow. Most of the time, in fact, regular expression matching in Perl is fast enough. As the graph shows, though, it is possible to write so-called “pathological” regular expressions that Perl matches very very slowly. In contrast, there are no regular expressions that are pathological for the Thompson NFA implementation. Seeing the two graphs side by side prompts the question, “why doesn't Perl use the Thompson NFA approach?” It can, it should, and that's what the rest of this article is about.

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

Leave a Comment

Related Posts

Recent Posts

Search code, repositories, users, issues, pull requests...

Tesla’s Autopilot and Full Self-Driving linked to hundreds of crashes, dozens of deaths

Microsoft releases MS-DOS 4 source code on GitHub — 45 year old code now open-source

Worse Than You Can Imagine 61

A Strong ‘Night Effect’ for Bitcoin: All of bitcoin’s gains come from overnight

Wisdom from Marcus Aurelius - by Gurwinder - The Prism

Creatine found to improve cognitive performance during sleep deprivation

Auto Safety Regulator Investigating Tesla Recall of Autopilot

In the face of bans, ByteDance tightens grip over US TikTok operations

You can’t teach caring… - by Jos Visser - Wednesday Wisdom

Why contributing to open source is scary and how to contribute anyway

Announcing two new LMS libraries

adding activitypub to humungus

DirectoryTree Authorization is a Native Role and Permission Management Package for Laravel

UMaine’s new 3D printer smashes former Guinness World Record to advance the next generation of advanced manufacturing - UMaine News - University of Maine

By-Product Valorization as a Means for the Brewing Industry to Move toward a Circular Bioeconomy

Scientists capture X-rays from upward positive lightning

A Chrome feature is creating enormous load on global root DNS servers

Learn one thing at a time | Lawrence Jones

Nightly Postgres Backups via GitHub Actions