Why Full Text Search is Hard

submited by
Style Pass
2024-05-04 14:00:02

It’s easy to find documents containing "large" and "elephant". It’s hard to find documents in German which have "large" and "elephant" together in a sentence, or words with similar meanings to large, and provide only the 10 most relevant documents.

And the sense that full-text search should be easy often stems from fixating on the middle part of "What’s so hard about implementing an inverted index?" and it’s not. If the use-case is happy with the query being a set of words, and only documents with exact matches being returned, then that is a very tractable problem domain. It’s all the challenges outside of that which are hard.

When a user searches for "car wash", should documents with "washing cars" in them? If so, a stemming algorithm is now required to understand how to normalize declined and conjugated words to a standard searchable form. Except not all users speak English and not all documents are in English, so a stemmer per language is required. And each language introduces their own language-specific challenges. Chinese, Japanese, and Korean don’t have whitespace for word separation, so one needs a completely different way of identifying words there (a "CJK tokenizer"). Don’t forget that Thai doesn’t use spaces around words, but instead uses them to separate phrases or sentences. German well known for its compound words, so those need to be split those apart. Russian is highly conjugated/declined and highly irregular. Hebrew needs normalization as letters can change based on position. Indian languages get written in informal English phonetics rather than the "proper" e.g. telugu script. Supporting a global service for searches means supporting many languages, and that’s hard.

Even within one language, should "car washing" return documents with "car cleaning" in them? Now an understanding of synonyms in a language is needed. Should "the red cat" return documents with "the" in it? Now an understanding of what words don’t carry meaning (stop words) is needed. (And again, remember that this is per language!)

Leave a Comment