LangExtract iconLangExtract

open-source Open-source Star36k

LangExtract is Google's open-source Python library that extracts structured information from unstructured text using LLMs, grounding every extraction to its exact character position in the source document for verifiable, hallucination-filtered output.

36.9K GitHub Stars
90+ Languages
20 Parallel Workers
Free Apache 2.0

Overview

LangExtract is Google's open-source Python library for extracting structured information from unstructured text using LLMs, with every extracted entity grounded back to its exact character position in the source document. Unlike traditional document parsers that convert formats, LangExtract specializes in semantic extraction — turning a clinical report, a legal contract, or a batch of customer reviews into structured JSON that is fully traceable back to the source text. Built-in hallucination filtering removes model-fabricated content not found in the source. Launched in July 2025, it gained rapid traction within months, with users citing it as a free replacement for enterprise extraction tools historically costing $50K+.

The Verdict

Who Should Use LangExtract?

Best For

  • Healthcare and legal teams extracting structured entities from narrative text with full audit trails
  • Developers needing schema-based JSON extraction from long documents without fine-tuning
  • Compliance workflows requiring traceable, verifiable extraction (every output cites its source character span)
  • Multilingual extraction pipelines — supports 90+ languages via Gemini models
  • Teams using Gemini, OpenAI, or Ollama that want a unified extraction framework

Not Ideal For

  • PDF or image parsing — LangExtract is text-in/JSON-out; use Docling or LlamaParse upstream for OCR
  • Teams needing vendor support or SLAs — this is a Google research project, not a supported product
  • Workflows requiring built-in document classification, chunking UI, or workflow automation
  • Users wanting zero-prompt setup — extraction quality depends on well-designed schemas and few-shot examples

What's Great

  • Character-level source grounding — every extracted entity maps back to exact positions in source text, enabling audit trails
  • Built-in hallucination filtering removes model-fabricated content not present in the source
  • Handles documents up to 147,000+ characters via automatic chunking with up to 20 parallel workers
  • Interactive HTML visualization shows extracted entities highlighted in context — great for review workflows
  • Replaces enterprise tools historically costing $50K+ — completely free under Apache 2.0
  • Flexible deployment: Gemini API, Vertex AI batch processing, OpenAI, or fully offline via Ollama
  • No fine-tuning required — works with prompts and few-shot examples

Watch Out For

  • Text-only input — requires upstream OCR tools (Docling, LlamaParse) for PDFs, scans, or images
  • Not an officially supported Google product — no vendor SLA, support tickets, or guaranteed maintenance
  • No built-in OCR, document classification, chunking UI, or workflow automation
  • Extraction quality depends heavily on prompt and schema design — requires developer investment upfront
  • Can supplement extractions with model knowledge, risking hallucination if grounding filters miss edge cases

Pricing

View all features & details

Core Capabilities

  • Source Grounding — maps every extraction to exact character offsets in original text
  • Hallucination Filtering — detects and removes fabricated content not in source
  • Long Document Processing — handles 147,000+ character documents via chunking
  • 20 parallel workers for high-throughput extraction
  • Interactive HTML visualization — extracted entities highlighted in context

Supported LLM Providers

  • Gemini 2.5 Flash / Pro (primary, official)
  • OpenAI GPT models
  • Ollama (local / offline inference)
  • Vertex AI Batch Processing (enterprise scale)
  • Custom model providers via plugin interface

Use Cases

  • Healthcare: clinical narratives, radiology reports, medication extraction
  • Legal/Finance: contract terms, entity relationships, compliance extraction
  • Customer feedback: categorize reviews into bugs, features, complaints
  • Technical docs: part numbers, specs from dense documentation
  • RAG enrichment: generate structured metadata for retrieval pipelines

Integrations

  • Microsoft Presidio (PII/PHI detection)
  • Elasticsearch (community integration)
  • Works downstream of Docling, LlamaParse, or any OCR pipeline
  • PyPI: pip install langextract

Languages & Scale

  • 90+ languages via Gemini models
  • Schema-based extraction — no fine-tuning required
  • Few-shot examples supported for domain adaptation
  • Latest release: v1.5.0 (May 2026)

How It Compares

Feature LangExtract Docling LlamaParse
Primary Purpose Structured extraction Document parsing Document parsing
PDF/Image Input No (text only) Yes Yes
Source Grounding Character-level Bounding boxes Bounding boxes
Hallucination Filter Built-in N/A N/A
Self-hosted Yes (Ollama) Yes No
Cost Free (OSS) Free (OSS) $0.0013–$0.056/page
Multilingual 90+ languages Experimental 100+ languages
LLM Providers Gemini, OpenAI, Ollama N/A LlamaIndex cloud
Best For Semantic extraction + audit Layout parsing + tables Speed + complex PDFs

User Reviews

Loading reviews...