PageIndex

open-source Open-source Star26k

PageIndex is a vectorless document retrieval and AI platform by VectifyAI that uses hierarchical tree indexing and LLM reasoning instead of embeddings, achieving 98.7% accuracy on FinanceBench — far surpassing standard vector RAG approaches.

rag reasoning api available mcp server

98.7% FinanceBench Accuracy

26K+ GitHub Stars

23K+ Cloud Users

<1 min 740-page SEC filing

Overview

PageIndex (by VectifyAI) is a fundamentally different approach to document retrieval. Instead of chunking documents into fragments and searching by embedding similarity — the standard RAG approach — PageIndex builds a hierarchical semantic tree of each document that preserves its full structure, then uses LLM reasoning to navigate that tree like a human expert. The result is 98.7% accuracy on FinanceBench, compared to ~31% for GPT-4o direct Q&A and ~45% for Perplexity. Every answer cites specific page and section references. Built on an open-source framework launched in September 2025 and now serving 23,000+ cloud users, it is especially powerful for long, structured professional documents: SEC filings, legal contracts, clinical documentation, and financial reports.

The Verdict

Who Should Use PageIndex?

Best For

Financial analysts working with SEC filings (10-K, 10-Q, 8-K) and earnings transcripts
Legal professionals extracting from long, structured contracts and regulatory filings
Enterprise teams needing auditable, citeable document retrieval without vector database infrastructure
Developers building RAG pipelines who want 20+ percentage point accuracy gains over standard vector search
Researchers and analysts working with long scientific or technical documents

Not Ideal For

Simple document Q&A where standard RAG accuracy is sufficient
Workloads requiring the lowest possible latency — tree navigation has higher latency than vector lookup
Teams needing transparent, publicly listed pricing without a sales conversation
Multi-hop queries at massive scale where per-node LLM calls create significant API cost

What's Great

98.7% accuracy on FinanceBench — vs. ~31% GPT-4o and ~45% Perplexity on the same benchmark
No vector database infrastructure required — lower operational overhead
Every answer includes precise page and section citations — fully auditable
Document structure fully preserved — no chunk-boundary context loss
Processes a 740-page SEC filing in under 1 minute using ~70 LLM API calls
Scales to millions of documents in a single index (PageIndex File System)
Open-source framework with API and MCP access for developers

MarkTechPost · PageIndex Blog · Official Site

Watch Out For

Higher latency than vector search — acceptable for analyst workflows but not for sub-second retrieval
Each tree node requires an LLM call during indexing — costs scale with document complexity and size
Hybrid semantic search fallback is a future milestone, not yet available
Built-in retrieval and answer synthesis layer is limited — may require custom implementation
Pricing for cloud and enterprise tiers not publicly listed — requires account or sales contact
PDF parsing edge cases acknowledged — hosted OCR API recommended for mission-critical work

Independent Review (Medium)

Pricing

Open Source

Free

Self-hosted framework. Requires your own LLM API key. Full control over data and deployment.

Cloud

Contact

Managed platform. 23K+ users in production. Sign up at dash.pageindex.ai — pricing not publicly listed.

Enterprise

Custom

Dedicated/VPC deployment. PageIndex File System (millions of docs). Dedicated technical support.

View all features & details

Core Technology

Vectorless retrieval — no embeddings, no vector database, no chunking
Hierarchical tree index — documents organized into semantic tree preserving headers, tables, footnotes
LLM reasoning navigation — reasons through tree structure rather than similarity search
Vision-native — can retrieve from page images with layout awareness
Explainable results — every answer cites specific page and section references

Key Products

PageIndex Core — open-source tree indexing framework
PageIndex File System — massive-scale multi-document search (millions of docs)
Mafin 2.5 — financial agent built on PageIndex (98.7% FinanceBench)
MCP Server — model context protocol integration
Developer API — REST API for integration

Use Cases

SEC filings (10-K, 10-Q, 8-K) analysis and KPI extraction
Earnings call transcript analysis and period comparison
Legal contract review and term extraction
Pharmaceutical / clinical documentation
Investment memo drafting from source documents
Academic paper and technical manual retrieval

Benchmark Performance

FinanceBench: PageIndex 98.7% vs GPT-4o ~31% vs Perplexity ~45%
20+ percentage point accuracy gains over standard vector RAG on multi-hop finance queries
740-page SEC filing indexed in under 1 minute (~70 LLM calls)
Scales to millions of documents in a single index

How It Compares

Feature	PageIndex	Standard Vector RAG	LlamaParse + RAG
FinanceBench Accuracy	98.7%	~50-60%	~70-80%
Retrieval Method	Tree + LLM reasoning	Embedding similarity	Embedding similarity
Vector DB Required	No	Yes	Yes
Document Structure Preserved	Fully	Partially (chunks)	Partially
Answer Citations	Page + section	Chunk-level	Chunk-level
Latency	Moderate	Fast	Fast
Self-hosted	Yes (OSS)	Varies	No
Pricing	OSS free / Cloud custom	Infrastructure cost	$0.0013–$0.056/page
Best For	Complex structured docs	General purpose	Speed + complex PDFs

User Reviews

Loading reviews...