|
20 table-lookup questions. 14 Treasury Bulletins (~2,150 pages). Three chunking pipelines through identical embeddings. One metric: the minimum context window to recover all evidence. |
AI can't remember facts. And that's a problem for the multibillion-dollar industry built on top of it.
Consider a simple query: ”What was the average US banknote in circulation worth in 2011?” A large language model might confidently say $10 — or $20, or $50. The uncertainty isn't a bug; it's how these models work. The correct answer — $32.70 — sits in a denomination table buried inside a 150-page Treasury Bulletin.
For any task requiring factual accuracy, one must provide the facts at query time. This is the job of Retrieval-Augmented Generation (RAG). But standard RAG pipelines chop documents into flat, fixed-length “chunks” and hope for the best:
- Orphaned headings arrive in prompts without their corresponding text
- Split tables lose their integrity when cut in half
- Broken context occurs when values get separated from their defining row and column headers
- Hallucinatory gaps emerge when models receive partial context and fill voids with statistically likely — but factually incorrect — information
POMA's answer is deceptively simple: respect a document's native structure instead of imposing arbitrary boundaries. Like a librarian who understands that a table's meaning lives in its headers, POMA preserves the document's hierarchy before breaking it apart.
| Method | 100% Recall At | vs POMA |
|---|---|---|
| Naive Chunking (500/100) | 1.45M tokens | POMA uses 77% less |
| Unstructured.io | 1.48M tokens | POMA uses 77% less |
| POMA Chunksets | 340K tokens | baseline |
“Worst-case budget” = the context window needed to cover evidence for the hardest question. At a ~340K maximum context limit, POMA can cover evidence for every question; baselines need ~1.45–1.48M.
Treasury Bulletins are structured objects: multi-column tables with spanning headers, footnotes defining abbreviations, section titles that disambiguate which statistic you're reading.
When you run RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100), you're hoping the retriever finds your needle. But what if the chunk containing "5.47%" got separated from the row header saying "1996"?
This is context fragmentation: not failure to retrieve something, but failure to retrieve the meaning with the number.
Not all questions are equally hard. This heatmap shows the context window needed for each question (log scale):
Green = small context needed. Red = large context needed.
POMA (top row) is consistently greener and shows much fewer outliers. The two baselines show that very coarse (naive) chunking and very fine-grained one either face the same challenges, or show clear trade-offs when one excels exactly where the other one fails.
POMA's worst case (340K, dashed line) is lower than any baseline's worst case by a factor of ~4×.
Sum of per-question minimum budgets (how much context you "buy" in aggregate). This table includes POMA Mixed, which re-ranks chunksets using a weighted blend of chunkset and chunk similarity (see Methods Compared for details):
| Method | Total Tokens (all 20 questions) |
|---|---|
| Unstructured.io | 6.55M |
| Naive Chunking | 5.78M |
| POMA Chunksets | 1.35M |
| POMA Mixed | 1.18M |
- Documents: 14 U.S. Treasury Bulletins (1939-2020), ~2,150 pages
- Questions: 20 table-lookup questions from OfficeQA
- Source: FRASER Digital Library
The benchmark compares three chunking strategies head-to-head, plus an optimized POMA variant:
| Method | Approach | Retrieval Units |
|---|---|---|
| Naive Chunking | Token-based RecursiveCharacterTextSplitter(500, 100) |
~9K chunks |
| Unstructured.io | Element-based extraction (by_title strategy) | ~26K elements |
| POMA Chunksets | Hierarchical chunks grouped into semantic bundles | ~21K chunksets |
| POMA Mixed | Chunksets re-ranked by a weighted blend of chunkset + chunk similarity | same |
- POMA Chunksets is the apples-to-apples baseline — same retrieval logic as the others, no tuning.
- POMA Mixed exploits POMA's dual output (chunks and chunksets) to re-rank results. It uses a tunable weight (
w=0.84here) and is dataset-specific, so the headline comparison uses Chunksets.
All methods use identical embeddings (text-embedding-3-large) and identical evaluation logic. The only variable is how documents are represented/chunked.
We do not claim that 500/100 are LangChain defaults; they are a common token-based baseline choice used in many RAG implementations.
- Naive baseline: 500-token chunks with 100-token overlap (20%), token-counted with
cl100k_base
For reference, managed retrievers use larger chunks and/or higher overlap:
- OpenAI File Search: 800-token chunks, 400-token overlap (docs)
- Google Vertex AI RAG Engine: 1024-token chunks, 256-token overlap (docs)
- Tokenizer:
tiktokenwithcl100k_base - Counted text: Raw chunk text only (no metadata, no prompt templates)
- Deduplication: Identical text blocks counted once
- Embed the question using
text-embedding-3-large - Rank ALL chunks by cosine similarity (no early stopping)
- Accumulate chunks in rank order, deduplicating identical content
- Binary search to find minimum budget where all gold indices are present (5K precision)
We use exact chunk indices as ground truth, not text patterns:
{
"UID0184": {
"found_in": {"treasury_bulletin_1948_03": [524]},
"needle_values": ["1.25"]
}
}This eliminates false positives (e.g., "5.47%" appearing in the wrong year's row). Gold sets live in data/evidence_requirements.json.
- Same embedding model for all methods
- Same retrieval logic (cosine similarity ranking)
- Pre-registered inclusion rule: Questions excluded if any of the methods can't find the evidence
We excluded questions where baselines had extraction failures (OCR errors, missing values). The 20 included questions are answerable by all methods.
git clone https://github.com/poma-ai/poma-officeqa.git
cd poma-officeqa
pip install -r requirements.txt
python benchmark.pyEmbeddings are pre-computed in vectors/. Set OPENAI_API_KEY only if you want to recompute.
python benchmark.py --method poma_openai_chunksets
python benchmark.py --method poma_openai_mixed
python benchmark.py --method databricks_rcs_openai
python benchmark.py --method unstructured_openaipython visualize_results.pyPOMA produces both chunks and chunksets. Mixed re-ranks chunksets by blending both similarity signals:
score = w * chunkset_sim + (1-w) * max(chunk_sims_in_chunkset)
With w=0.84, POMA Mixed achieves 100% recall at 250K tokens — an 83% reduction over baselines and 26% over Chunksets alone:
python benchmark.py --method poma_openai_mixed --weight 0.84 --agg maxThe weight is tunable per use case; 0.84 was optimized for this dataset. For a parameter-free comparison, use POMA Chunksets.
poma-officeqa/
├── benchmark.py # Main benchmark runner
├── embed_all.py # Embedding script
├── visualize_results.py # Generate charts
├── bench/ # Core benchmark modules
├── data/
│ ├── poma/ # 14 .poma files (POMA output)
│ ├── databricks/ # 14 .txt files (naive chunking input)
│ ├── unstructured/ # 14 .json files (Unstructured.io output)
│ ├── officeqa.csv # 20 questions
│ └── evidence_requirements.json # Ground truth indices
├── vectors/ # Pre-computed embeddings (~1 GB)
└── results/
├── bench_results/ # JSON outputs
└── visualizations/ # Charts (used in this README)
- Product: https://www.poma-ai.com
- Docs: https://www.poma-ai.com/docs/
- API: https://api.poma-ai.com/api/v1/docs
- Pricing: https://www.poma-ai.com/pricing
- Chunksets & Cheatsheets: How POMA Solves AI's Chunking Puzzle
- RAG Chunking Strategies & Text Splitters
- Document Ingestion & Chunking for RAG
- X: POMA_AI
- LinkedIn: POMA Science
- GitHub: poma-ai
@software{poma_officeqa_2026,
title = {POMA-OfficeQA: Benchmarking POMA AI RAG Chunking against Conventional Chunking and Unstructured.io},
author = {POMA AI GmbH, Berlin},
year = {2026},
url = {https://github.com/poma-ai/poma-officeqa},
note = {Benchmark showing 77\% token reduction with semantic hierarchical chunking}
}MIT License — see LICENSE






