Index 1,600,000,000 Keys with Automata and Rust

submited by
Style Pass
2021-08-16 22:30:07

It turns out that finite state machines are useful for things other than expressing computation. Finite state machines can also be used to compactly represent ordered sets or maps of strings that can be searched very quickly.

In this article, I will teach you about finite state machines as a data structure for representing ordered sets and maps. This includes introducing an implementation written in Rust called the fst crate. It comes with complete API documentation. I will also show you how to build them using a simple command line tool. Finally, I will discuss a few experiments culminating in indexing over 1,600,000,000 URLs (134 GB) from the July 2015 Common Crawl Archive.

Along the way, we will talk about memory maps, automaton intersection with regular expressions, fuzzy searching with Levenshtein distance and streaming set operations.

Target audience: Some familiarity with programming and fundamental data structures. No experience with automata theory or Rust is required.

Leave a Comment