In this article, I’ll walk through how I setup a local LLM to be a pairwise labeler, that tells us which two search results are more relevant for a

Turning my laptop into a Search Relevance Judge with local LLMs

submited by
Style Pass
2025-01-15 17:30:33

In this article, I’ll walk through how I setup a local LLM to be a pairwise labeler, that tells us which two search results are more relevant for a query. This will let us have an LLM ‘clippy’📎 in the loop to compare two search relevance algorithms. I’ll push the labeler to be confident and precise, letting it say “I dont know”.

Importantly, I’ll use a local LLM - Qwen 2.5 from Alibaba. My example repo sets all this up, letting you do this all for free.

Why use LLMs to label results? Well our typical sources for labeling search results are pretty annoying. We haev two primary sources:

Live humans - Asking our friends, colleagues, or outside firms to label results as relevant / not. Unfortunately humans must be heavily coached to produce consistent ratings. Its a lot of tedious work, and not always consistent.

Clickstream data - Mining clicks and conversions for labels using statistical models. To do this, we need a lot of traffic on one query. We further need to account for the most pernicous problem of all - presentation bias - that humans only click on what they can see

Leave a Comment