TruLens

open-source Open-source Star3k

Open-source library for evaluating and tracking LLM applications using feedback functions for groundedness, relevance, and safety

observability

3.4K+ GitHub Stars

30+ Feedback Functions

100% Open Source

Overview

TruLens is an open-source Python library developed by TruEra for evaluating, tracking, and iterating on LLM applications. It provides a comprehensive suite of "feedback functions" that measure key quality metrics like groundedness, answer relevance, context relevance, and harmlessness. TruLens integrates with popular frameworks including LangChain, LlamaIndex, and custom RAG pipelines, offering both programmatic evaluation and a local dashboard for visualizing results. Originally focused on model interpretability, TruEra pivoted TruLens to address the growing need for LLM observability and evaluation in production systems.

The Verdict

Who Should Use TruLens?

Best For

RAG application developers needing hallucination detection
Teams requiring detailed groundedness scoring
LangChain/LlamaIndex users wanting native integration
Researchers needing customizable feedback functions
Projects requiring self-hosted evaluation infrastructure

Not Ideal For

Teams needing enterprise SaaS with support (consider Confident AI)
Simple unit testing workflows (DeepEval is simpler)
RAG-only metrics without custom needs (Ragas is more focused)
Production monitoring at scale (consider Langfuse or Arize)

What's Great

Rich library of pre-built feedback functions for RAG evaluation
Excellent groundedness and hallucination detection metrics
Native integration with LangChain and LlamaIndex
Local dashboard for visualizing evaluation results
Supports multiple LLM providers as evaluators
Completely open-source with active community
Modular design allows custom feedback function creation

GitHub - Official Docs

Watch Out For

Steeper learning curve than simpler alternatives
Dashboard is local-only without cloud hosting option
Documentation can be fragmented across versions
Evaluation costs add up when using LLM-based feedback
Less focused than Ragas for pure RAG evaluation
Requires more setup than managed platforms

GitHub Issues

Pricing

Open Source

Free

Full library, self-hosted dashboard, all feedback functions

TruEra Enterprise

Contact

Enterprise ML observability platform with TruLens integration

View all features & details

Core Features

Groundedness evaluation for RAG
Answer relevance scoring
Context relevance assessment
Harmlessness/safety checks
Custom feedback function creation
Chain-of-thought tracing
Cost and latency tracking
Local evaluation dashboard
Experiment comparison tools

Feedback Functions

Groundedness (NLI-based)
Answer relevance
Context relevance
Coherence scoring
Conciseness check
Harmlessness detection
Sentiment analysis
Moderation (toxicity, bias)
Custom LLM-based evaluators

Integrations

LangChain (TruChain)
LlamaIndex (TruLlama)
Custom Python apps (TruBasicApp)
OpenAI, Anthropic, Bedrock
HuggingFace models
Snowflake Cortex
Azure OpenAI
Local LLMs via Ollama

Platforms & Requirements

Python 3.8+
pip install trulens
Local SQLite or PostgreSQL
Streamlit dashboard
Jupyter notebook support
Docker deployment option

How It Compares

Feature	TruLens	Ragas	DeepEval
Focus	General LLM + RAG	RAG-specific	General LLM
Groundedness	Excellent NLI-based	Good	Good
Feedback Functions	30+ built-in	10+ RAG metrics	14+ metrics
LangChain Integration	Native (TruChain)	Yes	Yes
LlamaIndex Integration	Native (TruLlama)	Limited	Yes
Dashboard	Local Streamlit	None (code-only)	Confident AI Cloud
Custom Evaluators	Very flexible	Template-based	Class-based
Learning Curve	Medium-High	Low-Medium	Low
Best For	Detailed RAG analysis	Quick RAG metrics	CI/CD testing
GitHub Stars	~2.1K	~7K	~3.5K

Real-World Usage

Key Use Cases

RAG pipeline evaluation and debugging
Hallucination detection in production
A/B testing prompt variations
Comparing retrieval strategies
Safety and moderation checks
Model comparison studies

Evaluation Providers

OpenAI GPT-4/3.5
Anthropic Claude
Azure OpenAI
AWS Bedrock
HuggingFace models
Local models via Ollama

Getting Started

from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI
from trulens.core import Feedback

# Initialize provider
provider = OpenAI()

# Define feedback functions
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons)
f_relevance = Feedback(provider.relevance)

# Wrap your LangChain app
tru_recorder = TruChain(
    chain,
    app_name="RAG_App",
    feedbacks=[f_groundedness, f_relevance]
)

# Run with recording
with tru_recorder as recording:
    response = chain.invoke({"question": "What is RAG?"})

# Launch dashboard
from trulens.dashboard import run_dashboard
run_dashboard()

User Reviews

Loading reviews...