LangExtract

open-source Open-source Star36k

LangExtract is Google's open-source Python library that extracts structured information from unstructured text using LLMs, grounding every extraction to its exact character position in the source document for verifiable, hallucination-filtered output.

rag python multimodal api available

36.9K GitHub Stars

90+ Languages

20 Parallel Workers

Free Apache 2.0

Overview

LangExtract is Google's open-source Python library for extracting structured information from unstructured text using LLMs, with every extracted entity grounded back to its exact character position in the source document. Unlike traditional document parsers that convert formats, LangExtract specializes in semantic extraction — turning a clinical report, a legal contract, or a batch of customer reviews into structured JSON that is fully traceable back to the source text. Built-in hallucination filtering removes model-fabricated content not found in the source. Launched in July 2025, it gained rapid traction within months, with users citing it as a free replacement for enterprise extraction tools historically costing $50K+.

The Verdict

Who Should Use LangExtract?

Best For

Healthcare and legal teams extracting structured entities from narrative text with full audit trails
Developers needing schema-based JSON extraction from long documents without fine-tuning
Compliance workflows requiring traceable, verifiable extraction (every output cites its source character span)
Multilingual extraction pipelines — supports 90+ languages via Gemini models
Teams using Gemini, OpenAI, or Ollama that want a unified extraction framework

Not Ideal For

PDF or image parsing — LangExtract is text-in/JSON-out; use Docling or LlamaParse upstream for OCR
Teams needing vendor support or SLAs — this is a Google research project, not a supported product
Workflows requiring built-in document classification, chunking UI, or workflow automation
Users wanting zero-prompt setup — extraction quality depends on well-designed schemas and few-shot examples

What's Great

Character-level source grounding — every extracted entity maps back to exact positions in source text, enabling audit trails
Built-in hallucination filtering removes model-fabricated content not present in the source
Handles documents up to 147,000+ characters via automatic chunking with up to 20 parallel workers
Interactive HTML visualization shows extracted entities highlighted in context — great for review workflows
Replaces enterprise tools historically costing $50K+ — completely free under Apache 2.0
Flexible deployment: Gemini API, Vertex AI batch processing, OpenAI, or fully offline via Ollama
No fine-tuning required — works with prompts and few-shot examples

GitHub (official) · TechStartups · IDP Software

Watch Out For

Text-only input — requires upstream OCR tools (Docling, LlamaParse) for PDFs, scans, or images
Not an officially supported Google product — no vendor SLA, support tickets, or guaranteed maintenance
No built-in OCR, document classification, chunking UI, or workflow automation
Extraction quality depends heavily on prompt and schema design — requires developer investment upfront
Can supplement extractions with model knowledge, risking hallucination if grounding filters miss edge cases

IDP Software Review · MarkTechPost

Pricing

Open Source

Free

Apache 2.0 License. No usage fees. LLM API costs apply if using Gemini or OpenAI.

With Ollama

Fully Free

Run local models via Ollama for zero LLM API costs — fully air-gapped extraction pipeline.

View all features & details

Core Capabilities

Source Grounding — maps every extraction to exact character offsets in original text
Hallucination Filtering — detects and removes fabricated content not in source
Long Document Processing — handles 147,000+ character documents via chunking
20 parallel workers for high-throughput extraction
Interactive HTML visualization — extracted entities highlighted in context

Supported LLM Providers

Gemini 2.5 Flash / Pro (primary, official)
OpenAI GPT models
Ollama (local / offline inference)
Vertex AI Batch Processing (enterprise scale)
Custom model providers via plugin interface

Use Cases

Healthcare: clinical narratives, radiology reports, medication extraction
Legal/Finance: contract terms, entity relationships, compliance extraction
Customer feedback: categorize reviews into bugs, features, complaints
Technical docs: part numbers, specs from dense documentation
RAG enrichment: generate structured metadata for retrieval pipelines

Integrations

Microsoft Presidio (PII/PHI detection)
Elasticsearch (community integration)
Works downstream of Docling, LlamaParse, or any OCR pipeline
PyPI: pip install langextract

Languages & Scale

90+ languages via Gemini models
Schema-based extraction — no fine-tuning required
Few-shot examples supported for domain adaptation
Latest release: v1.5.0 (May 2026)

How It Compares

Feature	LangExtract	Docling	LlamaParse
Primary Purpose	Structured extraction	Document parsing	Document parsing
PDF/Image Input	No (text only)	Yes	Yes
Source Grounding	Character-level	Bounding boxes	Bounding boxes
Hallucination Filter	Built-in	N/A	N/A
Self-hosted	Yes (Ollama)	Yes	No
Cost	Free (OSS)	Free (OSS)	$0.0013–$0.056/page
Multilingual	90+ languages	Experimental	100+ languages
LLM Providers	Gemini, OpenAI, Ollama	N/A	LlamaIndex cloud
Best For	Semantic extraction + audit	Layout parsing + tables	Speed + complex PDFs

User Reviews

Loading reviews...