PageIndex
PageIndex is a vectorless document retrieval and AI platform by VectifyAI that uses hierarchical tree indexing and LLM reasoning instead of embeddings, achieving 98.7% accuracy on FinanceBench — far surpassing standard vector RAG approaches.
Overview
PageIndex (by VectifyAI) is a fundamentally different approach to document retrieval. Instead of chunking documents into fragments and searching by embedding similarity — the standard RAG approach — PageIndex builds a hierarchical semantic tree of each document that preserves its full structure, then uses LLM reasoning to navigate that tree like a human expert. The result is 98.7% accuracy on FinanceBench, compared to ~31% for GPT-4o direct Q&A and ~45% for Perplexity. Every answer cites specific page and section references. Built on an open-source framework launched in September 2025 and now serving 23,000+ cloud users, it is especially powerful for long, structured professional documents: SEC filings, legal contracts, clinical documentation, and financial reports.
The Verdict
Who Should Use PageIndex?
Best For
- Financial analysts working with SEC filings (10-K, 10-Q, 8-K) and earnings transcripts
- Legal professionals extracting from long, structured contracts and regulatory filings
- Enterprise teams needing auditable, citeable document retrieval without vector database infrastructure
- Developers building RAG pipelines who want 20+ percentage point accuracy gains over standard vector search
- Researchers and analysts working with long scientific or technical documents
Not Ideal For
- Simple document Q&A where standard RAG accuracy is sufficient
- Workloads requiring the lowest possible latency — tree navigation has higher latency than vector lookup
- Teams needing transparent, publicly listed pricing without a sales conversation
- Multi-hop queries at massive scale where per-node LLM calls create significant API cost
What's Great
- 98.7% accuracy on FinanceBench — vs. ~31% GPT-4o and ~45% Perplexity on the same benchmark
- No vector database infrastructure required — lower operational overhead
- Every answer includes precise page and section citations — fully auditable
- Document structure fully preserved — no chunk-boundary context loss
- Processes a 740-page SEC filing in under 1 minute using ~70 LLM API calls
- Scales to millions of documents in a single index (PageIndex File System)
- Open-source framework with API and MCP access for developers
Watch Out For
- Higher latency than vector search — acceptable for analyst workflows but not for sub-second retrieval
- Each tree node requires an LLM call during indexing — costs scale with document complexity and size
- Hybrid semantic search fallback is a future milestone, not yet available
- Built-in retrieval and answer synthesis layer is limited — may require custom implementation
- Pricing for cloud and enterprise tiers not publicly listed — requires account or sales contact
- PDF parsing edge cases acknowledged — hosted OCR API recommended for mission-critical work
Pricing
View all features & details
Core Technology
- Vectorless retrieval — no embeddings, no vector database, no chunking
- Hierarchical tree index — documents organized into semantic tree preserving headers, tables, footnotes
- LLM reasoning navigation — reasons through tree structure rather than similarity search
- Vision-native — can retrieve from page images with layout awareness
- Explainable results — every answer cites specific page and section references
Key Products
- PageIndex Core — open-source tree indexing framework
- PageIndex File System — massive-scale multi-document search (millions of docs)
- Mafin 2.5 — financial agent built on PageIndex (98.7% FinanceBench)
- MCP Server — model context protocol integration
- Developer API — REST API for integration
Use Cases
- SEC filings (10-K, 10-Q, 8-K) analysis and KPI extraction
- Earnings call transcript analysis and period comparison
- Legal contract review and term extraction
- Pharmaceutical / clinical documentation
- Investment memo drafting from source documents
- Academic paper and technical manual retrieval
Benchmark Performance
- FinanceBench: PageIndex 98.7% vs GPT-4o ~31% vs Perplexity ~45%
- 20+ percentage point accuracy gains over standard vector RAG on multi-hop finance queries
- 740-page SEC filing indexed in under 1 minute (~70 LLM calls)
- Scales to millions of documents in a single index
How It Compares
| Feature | PageIndex | Standard Vector RAG | LlamaParse + RAG |
|---|---|---|---|
| FinanceBench Accuracy | 98.7% | ~50-60% | ~70-80% |
| Retrieval Method | Tree + LLM reasoning | Embedding similarity | Embedding similarity |
| Vector DB Required | No | Yes | Yes |
| Document Structure Preserved | Fully | Partially (chunks) | Partially |
| Answer Citations | Page + section | Chunk-level | Chunk-level |
| Latency | Moderate | Fast | Fast |
| Self-hosted | Yes (OSS) | Varies | No |
| Pricing | OSS free / Cloud custom | Infrastructure cost | $0.0013–$0.056/page |
| Best For | Complex structured docs | General purpose | Speed + complex PDFs |