Back in January of 2012, Russ Cox posted an excellent blog post detailing how Google Code Search had worked, using a trigram index.
By that point, I’d already implemented early versions of my own livegrep source-code search engine, using a different indexing approach that I developed independently, with input from a few friends. This post is my long-overdue writeup of how it works.
A suffix array is a data structure used for full-text search and other applications, primarily these days in the field of bioinformatics.
A suffix array is pretty simple, conceptually: It’s a sorted array of all of the suffixes of a string. So, given the string “hither and thither”, we could construct a suffix array like so:
Note that there is no need for each element of the array to store the entire suffix: We can just store the indexes (the left-hand column), along with the original string, and look up the suffixes as needed. So the above array would be stored just as
There exist algorithms to efficiently construct suffix arrays (O(n) time), and in-place (constant additional memory outside of the array itself).