The costs associated with Elasticsearch's n-gram tokenizer are not documented enough, and it's being widely used with severe consequences to cluster c

Don't use n-gram in Elasticsearch and OpenSearch

submited by
Style Pass
2023-01-25 11:30:05

The costs associated with Elasticsearch's n-gram tokenizer are not documented enough, and it's being widely used with severe consequences to cluster cost and performance. In this post we will go through the use-cases where it's useful, and suggest alternative, more efficient approaches.

Elasticsearch and OpenSearch are the de-facto standard engines for text search, and both are often used as search engines over text documents corpus - for internal usage or serving external users.

Most searches are done using the default, built-in analyzers and tokenizers that break the text into tokens. Those are usually just words. For example, using the default analyzer The quick brown fox jumps over the lazy dog becomes a list of words [ the, quick, brown, fox, jumps, over, the, lazy, dog ], each of those is then searchable. This is what we often call “full-text search”, that is - finding a document by a single word or a phrase that exists in it.

For some use cases more specialized analyzers are used which do things a little differently, not necessarily emitting words in a human language, but some other output that is useful in other ways.

Leave a Comment