The Password Game is a kafkaesque masterpiece of absurdity. The goal is simple: create a password that satisfies all the rules, which get increasingly

‘The Password Game’ is a Solid Benchmark for Multimodal Agents

submited by
Style Pass
2025-01-22 23:00:03

The Password Game is a kafkaesque masterpiece of absurdity. The goal is simple: create a password that satisfies all the rules, which get increasingly obtuse.

The best way to get a feel for it is simply to try it yourself. There are insane videos of people getting through all 35 rules. Some absurd tasks include solving captchas, playing chess, geoguessing, and my favourite - keeping Paul 🐔 alive..

2025 seems to be the year of Agents™️, even though no one really agrees on what they really are. With all the hype around agents, it’s hard to truly know what these things can actually do or not.

You can also observe in real-time where multimodal agents might stumble and what reasoning led to it. And since the rules increase incrementally in difficulty, we can basically just keep track of how far an agent is able to get in the game to measure performance.

A recent library, browser use, has made setting this all up pretty easy. It takes screenshots of the browser and overlays elements on screen that the model can then interact with via tool use. It takes care of all the heavy-lifting. You just have to define a task for the LLMs. Here’s what the LLMs “see”:

Leave a Comment