TL;DR: We define the problem of web query similarity and train a transformer model to embed queries using web data. Our model beats large OpenAI embed

State-of-the-art Query2Query Similarity

submited by
Style Pass
2022-09-23 21:30:10

TL;DR: We define the problem of web query similarity and train a transformer model to embed queries using web data. Our model beats large OpenAI embedding models significantly while many orders of magnitude faster and cheaper.

Every day, billions of people look for information and answers to questions using search engines in the form of queries. The diversity of these queries is mind-boggling; even for Google, despite its dominant market share, 15% unique queries everyday are unique; that number is much higher for Neeva.

Unsurprisingly, search engines do a lot of query processing (synonymy, spelling, lemmatization, term weighting, bigrams and compounds, and contextual rewriting) to understand user queries and to generate various forms of these queries which are used to match to relevant web pages. These query operations are crafted by mining signals from a variety of sources such as web page texts, anchors, clicks, etc.

For existing search engines, it has taken billions of queries, millions in infrastructure dollars, and hundreds of thousands of engineering hours to extract these signals. This is not an option for us at Neeva — we are small, scrappy, and want to make a lot of progress very quickly. Recent advancements in deep learning provide us with a unique opportunity to leapfrog many of these steps.

Leave a Comment