OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematic

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2025-07-31 20:00:04

OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematics. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.

We built OpenBench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.

Some evaluations may require logging into HuggingFace to download the dataset. If bench prompts you to do so, or throws "gated" errors, defining the environment variable

As the authors of OpenBench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.

However, it is expected that developers may observe numerical discrepancies between OpenBench's scores and the reported scores from other sources.

Leave a Comment
Related Posts