This is a long-read version of a Haystack EU24 conference talk by Roman Grebennikov. Check out the slides or watch the video [TODO] if you’re a Gen

Nixiesearch: running Lucene over S3, and why we’re building a new serverless search engine

submited by
Style Pass
2024-10-10 09:30:03

This is a long-read version of a Haystack EU24 conference talk by Roman Grebennikov. Check out the slides or watch the video [TODO] if you’re a Gen Z.

If you’re running a large search engine deployment, you already have a personal list of things that can go wrong on a daily basis. 

Hypothesize about running Lucene-powered search in a stateless mode over S3 block store. Why do you even need a stateless search over a block store?

Introduce Nixiesearch, a new stateless search engine, and how we struggled to make it work nicely with S3. And how it ended with RAG, ML inference and hybrid search.

Unlike regular back-end applications, search engines nowadays are considered special and require additional “like a database” handling. The prize wheel above summarises author’s personal incident experience with Elasticsearch, OpenSearch and SOLR — but other modern vector search engines such as Weaviate and Qdrant are not immune.

Each of your search nodes is stateful and contains a tiny part of a distributed shared mutable index. And if you’ve ever read at least one book on systems design and engineering (such as the “Designing Data Intensive Applications” by M. Kleppmann), you’ll perfectly know that things aren’t going to be smooth sailing when there’s a “ distributed shared mutable” in the name:

Leave a Comment