Every other day a new language model drops and I give it this chess puzzle to see if it spits out the correct answer, just for lulz 🤣 I am yet to receive a correct answer in one shot. Maybe also to do with my terrible prompt engineering skills.
Today I tried guiding DeepSeek-R1 to the eventual result- from wrong FEN position parsing, to illegal moves and eventually to just providing it with the answer and asking it for the reasoning. A lot of literature has already been expended about LLMs not being able to play chess properly. Though this did not stop folks from organizing an LLMs chess championship this month where bots ate their own pieces, revived dead knights and totally forgot about the board positions.
But why this specific puzzle? First of all, the next move seems intuitively easy for an average chess player. It also has only five pieces- endgames of chess with five pieces are solved and assuming perfect play, can be stored for quick lookup (also called tablebase). The optimal data structure for upto 5 pieces takes less than 1GB of space. It shoots up to ~16TB for 7 pieces.
This also removes the requirement of doing an actual tree search and pruning like stockfish which is ridiculous to ask from a mere LLM, though maybe not for the current position. This puzzle also uses an uncommon motif of under-promotion, where you lose by promoting the pawn to the most powerful piece, a queen. Even with the knowledge of under-promotion and tablebase, the LLM would still need to know the 50-move draw rule of practical chess play to decide on the result. I have personally been irritated by opponents not agreeing to a draw and deciding to play out the 50 moves in an even position with no progress.