Inspect AI

oss Free Star2k

Open-source framework for large language model evaluation developed by the UK AI Safety Institute

—

2.5K+ GitHub Stars

50+ Built-in Benchmarks

UK AISI Government-Backed

Overview

Inspect AI is an open-source framework for evaluating large language models, developed and maintained by the UK AI Safety Institute (AISI). It provides a comprehensive Python-based toolkit for creating, running, and analyzing evaluations with built-in support for popular benchmarks like MMLU, GSM8K, HellaSwag, and ARC. The framework emphasizes reproducibility, extensibility, and safety-focused evaluation with features like sandboxed code execution, multi-turn agent tasks, and detailed scoring mechanisms. Inspect supports all major model providers including OpenAI, Anthropic, Google, Mistral, and local models via Ollama or HuggingFace.

The Verdict

Who Should Use Inspect AI?

Best For

AI safety researchers and red teamers
Teams needing reproducible evaluations
Organizations building custom benchmarks
Multi-model comparison studies
Government and compliance-focused teams

Not Ideal For

Simple prompt testing (use Promptfoo)
RAG-specific evals (try RAGAS)
Non-Python teams (Python-only)
CI/CD-first workflows (less integrated)

What's Great

Government-backed with strong safety focus
50+ built-in benchmark implementations
Sandboxed code execution for agent evals
Multi-turn conversation support
Extensible solver and scorer architecture
Provider-agnostic model support
Detailed logging and analysis tools

Official Docs - GitHub

Watch Out For

Python-only (no JavaScript/TypeScript SDK)
Steeper learning curve than simpler tools
Less CI/CD integration out of the box
Smaller community than commercial tools
Documentation still maturing

GitHub Issues

Pricing

Open Source

MIT License, full features, unlimited use

Community

Free

GitHub support, documentation

View all features & details

Core Features

Declarative task definitions
Built-in benchmark suite (MMLU, GSM8K, etc.)
Multi-turn agent evaluations
Sandboxed code execution
Custom solver chains
Model-graded scoring
Detailed logging and replay
Parallel evaluation runs

Model Providers

OpenAI (GPT-4, GPT-4o)
Anthropic (Claude 3.5, Claude 4)
Google (Gemini Pro, Ultra)
Mistral AI
Azure OpenAI
AWS Bedrock
Ollama (local models)
HuggingFace Transformers

Built-in Benchmarks

MMLU (Massive Multitask)
GSM8K (Math reasoning)
HellaSwag (Commonsense)
ARC (Reasoning)
TruthfulQA
HumanEval (Code)
GPQA (Graduate-level QA)
SWE-bench (Software engineering)

Platforms

Python 3.10+
pip / conda install
Docker support
VS Code extension
CLI interface
Jupyter notebooks

How It Compares

Feature	Inspect AI	Promptfoo	DeepEval	OpenAI Evals
Open Source	Yes (MIT)	Yes (MIT)	Yes	Yes
Language	Python	JS/TS	Python	Python
Built-in Benchmarks	50+	10+	15+	20+
Agent Evals	Multi-turn + sandbox	Basic	Basic	Limited
Safety Focus	UK AISI backed	General	General	OpenAI focus
CI/CD Integration	Manual	Native	Pytest	Basic
Model Providers	All major	All major	All major	OpenAI-first
Best For	Safety research	CI/CD testing	Unit tests	OpenAI apps

User Reviews

Loading reviews...