Earlier this month, in a pivotal shift that signals the evolving landscape of AI, Intercom bet big on Anthropic, replacing OpenAI in their newly launched Fin 2 chatbot. Fin 2 is powered by Anthropic’s Claude 3.5 Sonnet, a big change as the original Fin was powered by GPT4.
An obvious question would be “how big of an impact could they expect to see in the quality of their product?”. We can’t know exactly the system Intercom has worked to build, so when they say “With Claude, Fin answers more questions, more accurately, with more depth, and more speed”, we don’t have real technical context on that impact, from the outside. Focusing on accuracy, can we find a way to empirically compare the models they likely used (either GPT-4o or GPT-4o-mini) to the model they’ve moved to (Claude Sonnet 3.5)? These tests were conducted prior to the Oct 22nd update of Claude 3.5 Sonnet and represent the information Intercom had available to them to make their change.
To compare the potential performance of models many teams go on vibes, having ad-hoc conversations or using one or two ‘eval’ questions and ‘feeling’ good or bad about the model, prompt and configuration they have chosen. We’ve been thinking about the problem a little differently. For any task, we want to prove that a model is both suitable and reliable. To show both we need data, not vibes, at a scale that proves what we see isn’t a fluke.