"German string" optimizations in Spellbook

submited by
Style Pass
2024-11-05 13:30:02

Spellbook is a Rust spell-checking library I've written the style of Hunspell to bring spell checking to the Helix editor. It's more-or-less a Rust rewrite of Nuspell, which itself is more-or-less a rewrite of Hunspell. Spellbook has a pretty slim interface: you can instantiate a dictionary from Hunspell dictionary files and use it to check words. For a small example of how you might use Spellbook:

How Spellbook works exactly is beyond the scope of this post, so this section gives a simplified overview and deals with simplified types. If you're interested in more details, check out the Spellbook README or @zverok's Rebuilding the Spellchecker blog post and the Spellbook internals document.

A central part of the procedure to check a word is to look up word(s) in a hash table. This lookup table contains an entry for each "stem" in the dictionary. You might imagine that the Dictionary type is a wrapper around a HashSet<String>. This is correct in essence but Hunspell-like checkers don't store every possible word in memory. Instead there is some "compression." For an example from the en_US (American English) dictionary, the lookup table in Spellbook associates a stem "adventure" with a set of flags like 'D', 'R' and 'S'. The flags correspond to rules defined for the dictionary allowing transformations like prefixes and suffixes. 'D' for example allows adding the "d" (or "ed" or "ied", depending on the stem) suffix, producing "adventured." 'R' allows "adventurer" and 'S' allows "adventures." So we can imagine that the lookup table has a type similar to HashMap<String, HashSet<Flag>>.

Despite the "compression" that prefixes and suffixes enable, the lookup table contains many entries. The exact number varies with which dictionary files you use as input but American English contains around 50,000 stems, and it's a relatively slim dictionary. Others contain hundreds of thousands or even millions of stems, so it's worth trying to optimize the space we take for each stem.

Leave a Comment