TruLens iconTruLens

open-source Open-source Star3k

Open-source library for evaluating and tracking LLM applications using feedback functions for groundedness, relevance, and safety

3.4K+ GitHub Stars
30+ Feedback Functions
100% Open Source

Overview

TruLens is an open-source Python library developed by TruEra for evaluating, tracking, and iterating on LLM applications. It provides a comprehensive suite of "feedback functions" that measure key quality metrics like groundedness, answer relevance, context relevance, and harmlessness. TruLens integrates with popular frameworks including LangChain, LlamaIndex, and custom RAG pipelines, offering both programmatic evaluation and a local dashboard for visualizing results. Originally focused on model interpretability, TruEra pivoted TruLens to address the growing need for LLM observability and evaluation in production systems.

The Verdict

Who Should Use TruLens?

Best For

  • RAG application developers needing hallucination detection
  • Teams requiring detailed groundedness scoring
  • LangChain/LlamaIndex users wanting native integration
  • Researchers needing customizable feedback functions
  • Projects requiring self-hosted evaluation infrastructure

Not Ideal For

  • Teams needing enterprise SaaS with support (consider Confident AI)
  • Simple unit testing workflows (DeepEval is simpler)
  • RAG-only metrics without custom needs (Ragas is more focused)
  • Production monitoring at scale (consider Langfuse or Arize)

What's Great

  • Rich library of pre-built feedback functions for RAG evaluation
  • Excellent groundedness and hallucination detection metrics
  • Native integration with LangChain and LlamaIndex
  • Local dashboard for visualizing evaluation results
  • Supports multiple LLM providers as evaluators
  • Completely open-source with active community
  • Modular design allows custom feedback function creation

Watch Out For

  • Steeper learning curve than simpler alternatives
  • Dashboard is local-only without cloud hosting option
  • Documentation can be fragmented across versions
  • Evaluation costs add up when using LLM-based feedback
  • Less focused than Ragas for pure RAG evaluation
  • Requires more setup than managed platforms

Pricing

View all features & details

Core Features

  • Groundedness evaluation for RAG
  • Answer relevance scoring
  • Context relevance assessment
  • Harmlessness/safety checks
  • Custom feedback function creation
  • Chain-of-thought tracing
  • Cost and latency tracking
  • Local evaluation dashboard
  • Experiment comparison tools

Feedback Functions

  • Groundedness (NLI-based)
  • Answer relevance
  • Context relevance
  • Coherence scoring
  • Conciseness check
  • Harmlessness detection
  • Sentiment analysis
  • Moderation (toxicity, bias)
  • Custom LLM-based evaluators

Integrations

  • LangChain (TruChain)
  • LlamaIndex (TruLlama)
  • Custom Python apps (TruBasicApp)
  • OpenAI, Anthropic, Bedrock
  • HuggingFace models
  • Snowflake Cortex
  • Azure OpenAI
  • Local LLMs via Ollama

Platforms & Requirements

  • Python 3.8+
  • pip install trulens
  • Local SQLite or PostgreSQL
  • Streamlit dashboard
  • Jupyter notebook support
  • Docker deployment option

How It Compares

Feature TruLens Ragas DeepEval
Focus General LLM + RAG RAG-specific General LLM
Groundedness Excellent NLI-based Good Good
Feedback Functions 30+ built-in 10+ RAG metrics 14+ metrics
LangChain Integration Native (TruChain) Yes Yes
LlamaIndex Integration Native (TruLlama) Limited Yes
Dashboard Local Streamlit None (code-only) Confident AI Cloud
Custom Evaluators Very flexible Template-based Class-based
Learning Curve Medium-High Low-Medium Low
Best For Detailed RAG analysis Quick RAG metrics CI/CD testing
GitHub Stars ~2.1K ~7K ~3.5K

Real-World Usage

Key Use Cases

  • RAG pipeline evaluation and debugging
  • Hallucination detection in production
  • A/B testing prompt variations
  • Comparing retrieval strategies
  • Safety and moderation checks
  • Model comparison studies

Evaluation Providers

  • OpenAI GPT-4/3.5
  • Anthropic Claude
  • Azure OpenAI
  • AWS Bedrock
  • HuggingFace models
  • Local models via Ollama

Getting Started

from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI
from trulens.core import Feedback

# Initialize provider
provider = OpenAI()

# Define feedback functions
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons)
f_relevance = Feedback(provider.relevance)

# Wrap your LangChain app
tru_recorder = TruChain(
    chain,
    app_name="RAG_App",
    feedbacks=[f_groundedness, f_relevance]
)

# Run with recording
with tru_recorder as recording:
    response = chain.invoke({"question": "What is RAG?"})

# Launch dashboard
from trulens.dashboard import run_dashboard
run_dashboard()

User Reviews

Loading reviews...