Promptfoo
Open-source CLI and library for LLM evaluation and red teaming. Enables systematic prompt testing, model comparison, vulnerability scanning, and automated security assessments with CI/CD integration.
Overview
Promptfoo is an open-source CLI and library for evaluating, testing, and red teaming LLM applications. It enables developers to systematically test prompts against datasets, compare model outputs side-by-side, and run automated security assessments including vulnerability scanning and adversarial attack simulations. Built with a developer-first approach, it supports declarative YAML configs, concurrent evaluation execution, and integrates directly into CI/CD pipelines. Now part of OpenAI while maintaining its MIT license and open-source status, Promptfoo is used by engineering teams at major tech companies including OpenAI and Anthropic themselves.
The Verdict
Who Should Use Promptfoo?
Best For
- Teams needing systematic prompt evaluation workflows
- Security-conscious organizations requiring red teaming
- CI/CD-driven development with automated LLM testing
- Multi-model comparison and selection
- Enterprise security teams and CISOs
Not Ideal For
- Simple single-prompt applications
- Teams without Node.js in their stack
- Those needing production observability (use dedicated tools)
- Non-technical users wanting GUI-only workflows
What's Great
- Comprehensive security testing - 50+ vulnerability types covered
- Works with any LLM provider - OpenAI, Anthropic, Azure, local models
- Declarative YAML configs for reproducible evaluations
- CI/CD native - GitHub Actions, CLI integration
- Local execution - data never leaves your machine
- Side-by-side model comparison with matrix views
- MIT licensed, fully open source
- Backed by OpenAI with active development
Watch Out For
- Requires Node.js 20.20+ or 22.22+ environment
- CLI-focused - web UI is secondary
- Learning curve for YAML config syntax
- Enterprise features require sales contact
Pricing
View all features & details
Evaluation Features
- Declarative YAML test configs
- Matrix view comparisons
- Custom scoring metrics
- Concurrent execution
- Live reload and caching
- Web viewer for results
- Team sharing capabilities
Security & Red Teaming
- Prompt injection detection
- Jailbreak testing
- Data leak scanning
- Business rule violations
- Compliance risk assessment
- Real-time guardrails
- Code scanning in IDEs/CI
Supported Providers
- OpenAI / Azure OpenAI
- Anthropic Claude
- Google (Gemini)
- Amazon Bedrock
- HuggingFace
- Ollama (local models)
- Custom API endpoints
Platform & Installation
- npm / npx (primary)
- Homebrew (brew install)
- pip install
- TypeScript core (97%)
- Node.js 20.20+ or 22.22+
- MIT License
How It Compares
| Feature | Promptfoo | Langfuse Evals | Braintrust | Weights & Biases |
|---|---|---|---|---|
| Open Source | MIT License | Apache 2.0 | No | No |
| Red Teaming | 50+ vuln types | Basic | Limited | No |
| CI/CD Native | Yes | Via API | Via API | Via API |
| Local Execution | Yes | Self-host | No | No |
| Free Tier | Full features | 50K obs | Limited | Limited |
| Multi-Provider | All major + custom | All major | All major | All major |
| Primary Focus | Eval + Security | Observability | Evals | ML Ops |
| Best For | Security-first teams | Full observability | Data teams | ML workflows |