W&B Weave
Open source LLM observability and evaluation toolkit from Weights & Biases. Trace, evaluate, and monitor AI applications from experimentation to production with a single line of code.
Overview
W&B Weave is an open source observability and evaluation toolkit that helps developers trace, evaluate, and monitor LLM applications from experimentation to production. With a single line of code using the @weave.op decorator, developers can automatically log all inputs, outputs, and metadata at granular level—organizing data into navigable trace trees for debugging complex agentic workflows. Weave reached General Availability in December 2024 and is now part of CoreWeave following Weights & Biases' acquisition in March 2025 at a $1.7B valuation. It supports both Python (3.10+) and TypeScript/JavaScript environments with native integrations for 15+ LLM providers including OpenAI, Anthropic, and Google, plus 19+ frameworks including LangChain, LlamaIndex, Claude Agent SDK, and CrewAI.
The Verdict
Who Should Use W&B Weave?
Best For
- Teams already using W&B for ML experiment tracking
- Production agent debugging and root cause analysis
- Multi-agent system observability
- Teams wanting open source with enterprise backing
- Evaluation-heavy workflows with LLM-as-judge
Not Ideal For
- Teams wanting fully self-hosted (requires W&B account)
- LangChain-only shops (LangSmith more native)
- Simple single-model apps (overkill)
- Teams avoiding vendor ecosystems
What's Great
- Open source (Apache 2.0) with enterprise support
- Single-line integration via @weave.op decorator
- Native multi-agent trace trees with session/turn organization
- Built-in scorers for safety (toxicity, PII, hallucinations)
- Run evaluations on live production traces
- Automatic code/dataset/scorer versioning
- 35+ integrations including Claude Agent SDK
Watch Out For
- Requires W&B account—can't run fully standalone
- Younger than competitors (GA December 2024)
- Less community content than Langfuse/LangSmith
- Python 3.10+ required (no 3.9 support)
- Advanced features tied to W&B enterprise tiers
Pricing
View all features & details
Tracing
- @weave.op decorator for automatic logging
- Nested trace trees with session organization
- Multi-agent turn tracking
- Input/output/metadata capture
- Cost and latency tracking
- Code versioning per trace
Evaluation
- LLM-as-judge scorers
- Safety scorers (toxicity, bias, PII, hallucinations)
- Quality scorers (coherence, fluency, relevance)
- Custom scorer support
- Human/expert feedback collection
- Production trace evaluation
- Side-by-side comparison
LLM Providers
- OpenAI
- Anthropic
- Google AI
- Amazon Bedrock
- Azure OpenAI
- Cohere, Groq, Mistral
- LiteLLM (unified interface)
- Local models
Framework Integrations
- LangChain / LangGraph
- LlamaIndex
- Claude Agent SDK
- OpenAI Agents SDK
- CrewAI / AutoGen
- DSPy
- Haystack
- PydanticAI / Instructor
- Vercel AI SDK
How It Compares
| Feature | W&B Weave | Langfuse | LangSmith | Arize Phoenix |
|---|---|---|---|---|
| Open Source | Apache 2.0 | MIT | No | Apache 2.0 |
| Self-Hosted | No (needs W&B) | Yes | No | Yes |
| Agent Tracing | Native multi-agent | Via SDK | Native | Via SDK |
| Built-in Scorers | Safety + Quality | LLM-as-judge | Online evals | LLM-as-judge |
| Prod Evaluation | Live traces | Manual | Manual | Manual |
| ML Integration | Full W&B platform | None | None | Limited |
| Free Tier | Limited | 50K obs/mo | 5K traces/mo | Unlimited local |
| Best For | W&B users, agents | Full control | LangChain users | Local dev |