We recently talked about a mystery: All large language models (LLMs) are terrible at chess. All, that is, except for gpt-3.5-turbo-instruct, which for

OK, I can partly explain the LLM chess weirdness now

submited by

Style Pass

2024-11-21 18:00:04

We recently talked about a mystery: All large language models (LLMs) are terrible at chess. All, that is, except for gpt-3.5-turbo-instruct, which for some reason can play at an advanced amateur level. This is despite the fact that this model is more than a year old and much smaller than recent models. What’s going on?

Theory 4: There’s “competition” between different types of data, so for an LLM to play chess well, you need a large fraction of the data to be chess games.

Theory 7: Large enough base models are good at chess, but this doesn’t persist through instruction tuning to chat models, Dynomight you are so bad for not suggesting this, how are you so dumb and bad?

Here, I’ll show that recent chat models can play chess quite well, as long as you’re willing to go through sufficiently extreme contortions to figure out how to prompt them. Then I’ll give my theory for what’s happening.

I was astonished that half the internet is convinced that OpenAI is cheating. Many, many people suggested that there must be some special case in gpt-3.5-turbo-instruct that recognizes chess notation and calls out to an external chess engine.