OpenAI Evals

open-source Free Star18k

Framework for evaluating LLMs and LLM systems with an open-source registry of benchmarks

python

18.6K GitHub Stars

3K Forks

100% Open Source

Overview

OpenAI Evals is a framework for evaluating large language models (LLMs) or systems built using LLMs. It provides an existing registry of evals to test different dimensions of OpenAI models and the ability to write custom evals for specific use cases. You can also use your data to build private evals representing common LLM patterns in your workflow without exposing any data publicly. As OpenAI's President Greg Brockman noted, "evals is surprisingly often the bottleneck on progress" - making high-quality evals one of the most impactful investments for teams building with LLMs.

The Verdict

Who Should Use OpenAI Evals?

Best For

Teams heavily invested in OpenAI models
Researchers comparing model versions
Building reproducible evaluation benchmarks
Contributing evals to OpenAI's public registry
Academic-style evaluation workflows

Not Ideal For

Multi-provider LLM evaluation (try Promptfoo)
Production observability needs (try LangSmith)
Teams wanting a UI-first experience
Real-time monitoring and alerting
RAG-specific evaluation (try RAGAS)

What's Great

Official OpenAI framework with active maintenance
Extensive pre-built eval registry (safety, reasoning, math)
Model-graded evaluations for open-ended questions
No coding required for basic evals (YAML + JSON)
Completion Functions protocol for custom pipelines
Snowflake integration for logging results
Dashboard integration via OpenAI Platform

GitHub README · OpenAI Docs

Watch Out For

Primarily designed for OpenAI models
Limited UI - mostly CLI-based workflow
Not accepting custom code contributions currently
Requires Git-LFS for downloading full eval registry
API costs apply when running evaluations
Less active development than some alternatives

GitHub Issues · Build Eval Docs

Pricing

Open Source

Free

Full framework, all features included

OpenAI Dashboard

API Costs

Run evals directly in OpenAI Platform

Weights & Biases

W&B Pricing

Alternative integration for running evals

View all features & details

Eval Templates

Match - Exact string matching
Includes - Substring matching
FuzzyMatch - Flexible matching
JsonMatch - JSON structure comparison
ModelBasedClassify - LLM-graded evals
Fact - Factual consistency checking
ClosedQA - Question answering rubrics
Battle - Head-to-head comparisons

Key Features

YAML-based eval configuration
JSONL dataset format support
Completion Functions protocol
Chain-of-thought evaluation modes
Meta-evals for quality assurance
Registry system for sharing evals
Private eval support
Snowflake database logging

Integrations

OpenAI Platform Dashboard
Weights & Biases
LangChain LLMs
Snowflake Database
HuggingFace Hub
Git-LFS for data storage
Pre-commit hooks

Requirements

Python 3.9+
OpenAI API key
Git-LFS (for full registry)
pip install evals

CLI Commands

oaieval - Run evaluations
oaieval gpt-4 [eval_name]
Custom registry paths
Completion function selection

How It Compares

Feature	OpenAI Evals	Promptfoo	RAGAS	DeepEval
Primary Focus	OpenAI models	Multi-provider	RAG systems	General LLM
Interface	CLI + Dashboard	CLI + Web UI	Python API	Python API
Eval Registry	Large public registry	Custom only	RAG-focused	Built-in metrics
Model-Graded	Yes, extensive	Yes	Yes	Yes
Open Source	Yes	Yes	Yes	Yes
No-Code Setup	YAML + JSON	YAML config	Code required	Code required
Production Use	Research-focused	Production-ready	Production-ready	Production-ready
Learning Curve	Moderate	Low	Low	Low

Getting Started

# Install via pip
pip install evals

# Or clone for development
git clone https://github.com/openai/evals
cd evals
pip install -e .

# Download eval registry data
git lfs fetch --all
git lfs pull

# Run an eval
export OPENAI_API_KEY=your-key
oaieval gpt-4 test-match

User Reviews

Loading reviews...