They claim across the board improvements to Sonnet, and it has a new rather huge ability accessible via the API: Computer use. Nothing could possibly

Don't Worry About the Vase

submited by
Style Pass
2024-10-24 15:30:05

They claim across the board improvements to Sonnet, and it has a new rather huge ability accessible via the API: Computer use. Nothing could possibly go wrong.

Claude Haiku 3.5 is also claimed as a major step forward for smaller models. They are saying that on many evaluations it has now caught up to Opus 3.

Missing from this chart is o1, which is in some ways not a fair comparison since it uses so much inference compute, but does greatly outperform everything here on the AIME and some other tasks.

METR: We conducted an independent pre-deployment assessment of the updated Claude 3.5 Sonnet model and will share our report soon.

We only have very early feedback so far, so it’s hard to tell how much what I will be calling Claude 3.5.1 improves performance in practice over Claude 3.5. It does seem like it is a clear improvement. We also don’t know how far along they are with the new killer app: Computer usage, also known as handing your computer over to an AI agent.

Letting an LLM use a computer is super exciting. By which I mean both that the value proposition here is obvious, and also that it is terrifying and should scare the hell out of you on both the mundane level and the existential one. It’s weird for Anthropic to be the ones doing it first.

Leave a Comment