LLM agent evaluation — FUTURESEARCH

submited by

Style Pass

2024-09-29 08:00:16

We evaluated traces line-by-line of openai-o1, gpt-4o, claude-sonnet-3.5, llama-405b, with several agent architectures, on 8 real-world, white-collar tasks where we knew all the details of what it takes to succeed.

The result? In the words of one of our research scientists: “o1 is very impressive, but pretty moody… sometimes it works great and (almost) completely aces something all other agents struggle with, but often it's distinctly average.”

tl;dr Take o1 for a small chance to go all the way and Sonnet for more consistent performance; gpt-4o is not quite there whereas llama-405b and gpt-4o-mini are barely useful as agents completing tasks that require tool use and 10+ prompted decision points for updating their plans.

0.0 means no progress, 1.0 completed full set of necessary steps. Each task involves an agent using ~30 prompts as well as tool use.

Our eval is unique in that it is done on "messy" white-collar tasks inspired by work for our customers; and that we have sophisticated partial scores for progress on the numerous things to get right and get wrong. The tasks were of the nature: