In previous posts I used a local LLMs to choose which two products were more relevant for a search query (see this github repo). Using an open e-comme

Coping with dumb LLMs using classic ML

submited by
Style Pass
2025-01-22 09:30:03

In previous posts I used a local LLMs to choose which two products were more relevant for a search query (see this github repo). Using an open e-commerce search dataset as a baseline (WANDS from Wayfair), I measured the LLM’s preference for a product, seeing if it matched human raters. If we could do this, then I could use my laptop as a search relevance judge, guiding search quality tuning and iterations, without an expensive OpenAI bill.

My goal, not so much ‘perfect evaluation’ but to at least be a reliable enough flag that something looked amiss or promising. Given I’m using smaller LLMs, I want dumb decisions. Not big complex ones.

Then with all these responses, I could look at the human labels for these products, and see, if the agent is any good at predicting relevance. Then I can gain confidence, unleashing it on any product it comes across, and trust the results based on how well the strategy does with human labels.

With the different tradeoffs, forcing decision, or not, double checking, or not, we improve precision, but loose recall. So on the bottom right of the table for product name below, double checking AND allowing the LLM to say “Neither”, leaves us with 11.9% of 1000 product pairs where the LLM likes one product over another for the query. Within that 11.9%, we correctly predicted whether LHS was mor relevant 90.76% of the time. That’s up considerably from the upper left - getting a decision on each pair, but with precision degraded to 75.08%.

Leave a Comment