Set your GITHUB_TOKEN env var to a GitHub PAT (doesn't need any permissions). If you have gh installed you can use the one from gh auth token. AI mode

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-10-27 23:00:04

Set your GITHUB_TOKEN env var to a GitHub PAT (doesn't need any permissions). If you have gh installed you can use the one from gh auth token.

AI models are often evaluated against benchmarks or with direct human voting (e.g. LLMSYS/Chatbot Arena). Benchmarks have many known issues (leaking into training data, evaluating mostly-right answers, etc), and human voting biases towards longer and more impressive-sounding answers. A lot of the most informed people judge models based on vibe, or "big model smell". There's been some recent work at putting models in a simulated space (e.g. a Minecraft build-off here) to get a sense of their creativity and ability to construct a large or complex project, but that's really early so far.

I thought it'd be interesting to evaluate models based on their competition with each other in a simulated space: purely adversarial.

Disclaimer: I work on GitHub Models at GitHub, but this isn't a formal GitHub project or affiliated in any way. I built this on the weekend because I thought it was a neat idea.

Leave a Comment