Let's run some NFAs — 0xd34df00d.me

submited by
Style Pass
2024-10-12 04:30:33

Lately, I’ve been playing around with memoized NFAs for optimized regular expression matching, with features like lookahead and atomic groups, based on this paper. The original authors have their code in Scala, and I thought it’d be fun to code something in Haskell to see how it stacks up against their new implementation and the prior art.

But before diving into memoization and the more complex features, let’s start with the basics. In this post, we’ll focus on a simple, naive backtracking NFA implementation. We’ll start with the simplest, regexp 101 code and then make it significantly faster, step by step. We’ll also inevitably face some dead ends — that’s part of learning and experimentation, too!

To ground our work in reality, I’ll also implement some of the algorithms in C++, praised for its performance advantages over pretty much everything else. Is the praise deserved here? Let’s find out.

Typically, we’ll have q ~ Word32, but it’s nice to have it abstracted away: it’s easier to change the type this way, and we also won’t mistake q for anything else (like string length or position) even if it’s instantiated to the same type. Yay type safety!

Leave a Comment