Skip to content

poma-ai/poma-officeqa

Repository files navigation

Benchmarking POMA AI RAG Chunking against Conventional Chunking and Unstructured.io

Python 3.10+ License: MIT Dataset Docs

POMA uses 77% fewer tokens to achieve 100% context recall.

20 table-lookup questions. 14 Treasury Bulletins (~2,150 pages). Three chunking pipelines through identical embeddings. One metric: the minimum context window to recover all evidence.

Token Comparison


Why This Matters

AI can't remember facts. And that's a problem for the multibillion-dollar industry built on top of it.

Consider a simple query: ”What was the average US banknote in circulation worth in 2011?” A large language model might confidently say $10 — or $20, or $50. The uncertainty isn't a bug; it's how these models work. The correct answer — $32.70 — sits in a denomination table buried inside a 150-page Treasury Bulletin.

For any task requiring factual accuracy, one must provide the facts at query time. This is the job of Retrieval-Augmented Generation (RAG). But standard RAG pipelines chop documents into flat, fixed-length “chunks” and hope for the best:

  • Orphaned headings arrive in prompts without their corresponding text
  • Split tables lose their integrity when cut in half
  • Broken context occurs when values get separated from their defining row and column headers
  • Hallucinatory gaps emerge when models receive partial context and fill voids with statistically likely — but factually incorrect — information

POMA's answer is deceptively simple: respect a document's native structure instead of imposing arbitrary boundaries. Like a librarian who understands that a table's meaning lives in its headers, POMA preserves the document's hierarchy before breaking it apart.


Key Result

Accuracy Curves

Method 100% Recall At vs POMA
Naive Chunking (500/100) 1.45M tokens POMA uses 77% less
Unstructured.io 1.48M tokens POMA uses 77% less
POMA Chunksets 340K tokens baseline

“Worst-case budget” = the context window needed to cover evidence for the hardest question. At a ~340K maximum context limit, POMA can cover evidence for every question; baselines need ~1.45–1.48M.

Token Comparison


The Problem: Complex Information doesn't Survive Naive Chunking

Treasury Bulletins are structured objects: multi-column tables with spanning headers, footnotes defining abbreviations, section titles that disambiguate which statistic you're reading.

When you run RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100), you're hoping the retriever finds your needle. But what if the chunk containing "5.47%" got separated from the row header saying "1996"?

This is context fragmentation: not failure to retrieve something, but failure to retrieve the meaning with the number.


Results in Detail

Per-Question Token Requirements

Not all questions are equally hard. This heatmap shows the context window needed for each question (log scale):

Per-Question Heatmap

Green = small context needed. Red = large context needed.

POMA (top row) is consistently greener and shows much fewer outliers. The two baselines show that very coarse (naive) chunking and very fine-grained one either face the same challenges, or show clear trade-offs when one excels exactly where the other one fails.

Distribution of Token Requirements

Token Distribution

POMA's worst case (340K, dashed line) is lower than any baseline's worst case by a factor of ~4×.

Total Context Cost

Sum of per-question minimum budgets (how much context you "buy" in aggregate). This table includes POMA Mixed, which re-ranks chunksets using a weighted blend of chunkset and chunk similarity (see Methods Compared for details):

Method Total Tokens (all 20 questions)
Unstructured.io 6.55M
Naive Chunking 5.78M
POMA Chunksets 1.35M
POMA Mixed 1.18M

Cumulative Context Growth

Cumulative Context

Token Heaps (area proportional to tokens)

Token Heaps


Dataset

Methods Compared

The benchmark compares three chunking strategies head-to-head, plus an optimized POMA variant:

Method Approach Retrieval Units
Naive Chunking Token-based RecursiveCharacterTextSplitter(500, 100) ~9K chunks
Unstructured.io Element-based extraction (by_title strategy) ~26K elements
POMA Chunksets Hierarchical chunks grouped into semantic bundles ~21K chunksets
POMA Mixed Chunksets re-ranked by a weighted blend of chunkset + chunk similarity same
  • POMA Chunksets is the apples-to-apples baseline — same retrieval logic as the others, no tuning.
  • POMA Mixed exploits POMA's dual output (chunks and chunksets) to re-rank results. It uses a tunable weight (w=0.84 here) and is dataset-specific, so the headline comparison uses Chunksets.

All methods use identical embeddings (text-embedding-3-large) and identical evaluation logic. The only variable is how documents are represented/chunked.


Baseline Chunking Parameters

We do not claim that 500/100 are LangChain defaults; they are a common token-based baseline choice used in many RAG implementations.

  • Naive baseline: 500-token chunks with 100-token overlap (20%), token-counted with cl100k_base

For reference, managed retrievers use larger chunks and/or higher overlap:

  • OpenAI File Search: 800-token chunks, 400-token overlap (docs)
  • Google Vertex AI RAG Engine: 1024-token chunks, 256-token overlap (docs)

Methodology

Token Counting

  • Tokenizer: tiktoken with cl100k_base
  • Counted text: Raw chunk text only (no metadata, no prompt templates)
  • Deduplication: Identical text blocks counted once

Evaluation Algorithm

  1. Embed the question using text-embedding-3-large
  2. Rank ALL chunks by cosine similarity (no early stopping)
  3. Accumulate chunks in rank order, deduplicating identical content
  4. Binary search to find minimum budget where all gold indices are present (5K precision)

Why Index-Based Evaluation?

We use exact chunk indices as ground truth, not text patterns:

{
  "UID0184": {
    "found_in": {"treasury_bulletin_1948_03": [524]},
    "needle_values": ["1.25"]
  }
}

This eliminates false positives (e.g., "5.47%" appearing in the wrong year's row). Gold sets live in data/evidence_requirements.json.

Fairness Guarantees

  • Same embedding model for all methods
  • Same retrieval logic (cosine similarity ranking)
  • Pre-registered inclusion rule: Questions excluded if any of the methods can't find the evidence

We excluded questions where baselines had extraction failures (OCR errors, missing values). The 20 included questions are answerable by all methods.


Quick Start

git clone https://github.com/poma-ai/poma-officeqa.git
cd poma-officeqa
pip install -r requirements.txt
python benchmark.py

Embeddings are pre-computed in vectors/. Set OPENAI_API_KEY only if you want to recompute.

Run Specific Methods

python benchmark.py --method poma_openai_chunksets
python benchmark.py --method poma_openai_mixed
python benchmark.py --method databricks_rcs_openai
python benchmark.py --method unstructured_openai

Regenerate Visualizations

python visualize_results.py

Going Further: POMA Mixed

POMA produces both chunks and chunksets. Mixed re-ranks chunksets by blending both similarity signals:

score = w * chunkset_sim + (1-w) * max(chunk_sims_in_chunkset)

With w=0.84, POMA Mixed achieves 100% recall at 250K tokens — an 83% reduction over baselines and 26% over Chunksets alone:

python benchmark.py --method poma_openai_mixed --weight 0.84 --agg max

The weight is tunable per use case; 0.84 was optimized for this dataset. For a parameter-free comparison, use POMA Chunksets.


Repository Structure

poma-officeqa/
├── benchmark.py              # Main benchmark runner
├── embed_all.py              # Embedding script
├── visualize_results.py      # Generate charts
├── bench/                    # Core benchmark modules
├── data/
│   ├── poma/                 # 14 .poma files (POMA output)
│   ├── databricks/           # 14 .txt files (naive chunking input)
│   ├── unstructured/         # 14 .json files (Unstructured.io output)
│   ├── officeqa.csv          # 20 questions
│   └── evidence_requirements.json  # Ground truth indices
├── vectors/                  # Pre-computed embeddings (~1 GB)
└── results/
    ├── bench_results/        # JSON outputs
    └── visualizations/       # Charts (used in this README)

Try POMA

Background Reading

Follow


Citation

@software{poma_officeqa_2026,
  title   = {POMA-OfficeQA: Benchmarking POMA AI RAG Chunking against Conventional Chunking and Unstructured.io},
  author  = {POMA AI GmbH, Berlin},
  year    = {2026},
  url     = {https://github.com/poma-ai/poma-officeqa},
  note    = {Benchmark showing 77\% token reduction with semantic hierarchical chunking}
}

License

MIT License — see LICENSE

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages