PageIndex iconPageIndex

open-source Open-source Star26k

PageIndex is a vectorless document retrieval and AI platform by VectifyAI that uses hierarchical tree indexing and LLM reasoning instead of embeddings, achieving 98.7% accuracy on FinanceBench — far surpassing standard vector RAG approaches.

98.7% FinanceBench Accuracy
26K+ GitHub Stars
23K+ Cloud Users
<1 min 740-page SEC filing

Overview

PageIndex (by VectifyAI) is a fundamentally different approach to document retrieval. Instead of chunking documents into fragments and searching by embedding similarity — the standard RAG approach — PageIndex builds a hierarchical semantic tree of each document that preserves its full structure, then uses LLM reasoning to navigate that tree like a human expert. The result is 98.7% accuracy on FinanceBench, compared to ~31% for GPT-4o direct Q&A and ~45% for Perplexity. Every answer cites specific page and section references. Built on an open-source framework launched in September 2025 and now serving 23,000+ cloud users, it is especially powerful for long, structured professional documents: SEC filings, legal contracts, clinical documentation, and financial reports.

The Verdict

Who Should Use PageIndex?

Best For

  • Financial analysts working with SEC filings (10-K, 10-Q, 8-K) and earnings transcripts
  • Legal professionals extracting from long, structured contracts and regulatory filings
  • Enterprise teams needing auditable, citeable document retrieval without vector database infrastructure
  • Developers building RAG pipelines who want 20+ percentage point accuracy gains over standard vector search
  • Researchers and analysts working with long scientific or technical documents

Not Ideal For

  • Simple document Q&A where standard RAG accuracy is sufficient
  • Workloads requiring the lowest possible latency — tree navigation has higher latency than vector lookup
  • Teams needing transparent, publicly listed pricing without a sales conversation
  • Multi-hop queries at massive scale where per-node LLM calls create significant API cost

What's Great

  • 98.7% accuracy on FinanceBench — vs. ~31% GPT-4o and ~45% Perplexity on the same benchmark
  • No vector database infrastructure required — lower operational overhead
  • Every answer includes precise page and section citations — fully auditable
  • Document structure fully preserved — no chunk-boundary context loss
  • Processes a 740-page SEC filing in under 1 minute using ~70 LLM API calls
  • Scales to millions of documents in a single index (PageIndex File System)
  • Open-source framework with API and MCP access for developers

Watch Out For

  • Higher latency than vector search — acceptable for analyst workflows but not for sub-second retrieval
  • Each tree node requires an LLM call during indexing — costs scale with document complexity and size
  • Hybrid semantic search fallback is a future milestone, not yet available
  • Built-in retrieval and answer synthesis layer is limited — may require custom implementation
  • Pricing for cloud and enterprise tiers not publicly listed — requires account or sales contact
  • PDF parsing edge cases acknowledged — hosted OCR API recommended for mission-critical work

Pricing

View all features & details

Core Technology

  • Vectorless retrieval — no embeddings, no vector database, no chunking
  • Hierarchical tree index — documents organized into semantic tree preserving headers, tables, footnotes
  • LLM reasoning navigation — reasons through tree structure rather than similarity search
  • Vision-native — can retrieve from page images with layout awareness
  • Explainable results — every answer cites specific page and section references

Key Products

  • PageIndex Core — open-source tree indexing framework
  • PageIndex File System — massive-scale multi-document search (millions of docs)
  • Mafin 2.5 — financial agent built on PageIndex (98.7% FinanceBench)
  • MCP Server — model context protocol integration
  • Developer API — REST API for integration

Use Cases

  • SEC filings (10-K, 10-Q, 8-K) analysis and KPI extraction
  • Earnings call transcript analysis and period comparison
  • Legal contract review and term extraction
  • Pharmaceutical / clinical documentation
  • Investment memo drafting from source documents
  • Academic paper and technical manual retrieval

Benchmark Performance

  • FinanceBench: PageIndex 98.7% vs GPT-4o ~31% vs Perplexity ~45%
  • 20+ percentage point accuracy gains over standard vector RAG on multi-hop finance queries
  • 740-page SEC filing indexed in under 1 minute (~70 LLM calls)
  • Scales to millions of documents in a single index

How It Compares

Feature PageIndex Standard Vector RAG LlamaParse + RAG
FinanceBench Accuracy 98.7% ~50-60% ~70-80%
Retrieval Method Tree + LLM reasoning Embedding similarity Embedding similarity
Vector DB Required No Yes Yes
Document Structure Preserved Fully Partially (chunks) Partially
Answer Citations Page + section Chunk-level Chunk-level
Latency Moderate Fast Fast
Self-hosted Yes (OSS) Varies No
Pricing OSS free / Cloud custom Infrastructure cost $0.0013–$0.056/page
Best For Complex structured docs General purpose Speed + complex PDFs

User Reviews

Loading reviews...