OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematic

Search code, repositories, users, issues, pull requests...

submited by

Style Pass

2025-07-31 20:00:04

OpenBench provides standardized, reproducible benchmarking for LLMs across 20+ evaluation suites spanning knowledge, reasoning, coding, and mathematics. Works with any model provider - Groq, OpenAI, Anthropic, Cohere, Google, AWS Bedrock, Azure, local models via Ollama, and more.

We built OpenBench because we needed evaluation code that was easy to understand, modify, and trust. It's a curated set of benchmarks built on Inspect AI's excellent foundation.

Some evaluations may require logging into HuggingFace to download the dataset. If bench prompts you to do so, or throws "gated" errors, defining the environment variable

As the authors of OpenBench, we strive to implement this tool's evaluations as faithfully as possible with respect to the original benchmarks themselves.

However, it is expected that developers may observe numerical discrepancies between OpenBench's scores and the reported scores from other sources.

Search code, repositories, users, issues, pull requests...

Leave a Comment

Related Posts

Recent Posts

At 17, Hannah Cairo Solved a Major Math Mystery

Ergonomic keyboarding with the Svalboard

Time's Arrow Within Glass Appears To Go Both Ways, Raising Huge Questions

Drop your prod table, click “Replay”. A one-line tweak makes it possible.

Slurm vs K8s for AI Infra: Academic HPC vs Cloud-Native Reality - the non-ideal solutions

Idiot Proof SEO - Keyword Research, Made Simple!

miniHDL: A Python Hardware Description Language DSL

When Public Prompts Turn Into Local Shells: ‘CurXecute’ – RCE in Cursor via MCP Auto‑Start

Is dental amalgam safe for humans? The opinion of the scientific committee of the European Commission

Corporation for Public Broadcasting Addresses Operations Following Loss of Federal Funding

Hallmarks of cellular senescence: biology, mechanisms, regulations

How to build a ship for interstellar travel

Computer Science > Artificial Intelligence

Potatoes from tomatoes? Popular starchy vegetable derived from ancient interbreeding, researchers say

These Genetically Engineered Mosquitoes Could Wipe Out Malaria

ASCII Art Signatures In The Wild | geon

Israel-Gaza war: US envoy visits Gaza food distribution site as UN says 1,373 killed waiting for aid since late May – as it happened

Rollercoaster Tycoon (or, MicroProse’s Last Hurrah)

HDD Speaker (Hard Drive Speaker)

JetBrains details deeper mechanics of Junie coding agent