Skip to content

helix-ai-tech/akasha

Repository files navigation

akasha

high-performance pdf intelligence engine

ci crates.io docs license

problem

current pdf extraction requires stitching together multiple libraries with heavy dependencies. you need pdfplumber for text, camelot for tables (with opencv), tesseract for ocr. the result: 500mb+ of dependencies, version conflicts, and mediocre performance.

solution

akasha: single rust library, zero external dependencies, understands document structure.

# before: dependency hell
import pdfplumber  
import camelot     # needs opencv
import pytesseract # needs tesseract binary
import fitz        # needs mupdf

# after: one import
from akasha import Akasha
doc = Akasha().extract_all("document.pdf")

benchmarks

100-page pdf extraction:

akasha:     1.8s   45mb ram   98.5% accuracy   0 dependencies
pdfplumber: 12.3s  450mb ram  92.1% accuracy   15+ dependencies  
camelot:    18.7s  380mb ram  94.3% accuracy   20+ dependencies
llamaparse: 25.4s  890mb ram  96.2% accuracy   10+ dependencies

core features

  • unified extraction - text, tables, images, structure in one pass
  • structure-aware chunking - respects semantic boundaries for rag
  • source tracking - every extraction traced to exact coordinates
  • confidence scoring - reliability metrics on all outputs
  • parallel by default - uses all cores via rayon
  • streaming support - process while downloading

installation

rust (primary)

# Cargo.toml
[dependencies]
akasha = "0.1"

# or with cargo-edit
cargo add akasha

# with specific features
akasha = { version = "0.1", features = ["ocr", "ml-models"] }

python bindings

pip install akasha

node bindings

npm install akasha

docker

docker run -p 8080:8080 sybilstudio/akasha

quickstart

from akasha import Akasha

# extract everything
doc = Akasha().extract_all("document.pdf")

# tables with confidence
for table in doc.tables:
    if table.confidence > 0.9:
        df = table.to_pandas()
        
# smart chunking for rag
chunks = doc.chunk(strategy="semantic", max_tokens=512)

# export formats
markdown = doc.to_markdown()
json_data = doc.to_json()

rust usage

use akasha::{Akasha, Config};

let config = Config::builder()
    .parallel(true)
    .confidence_threshold(0.9)
    .build();

let akasha = Akasha::with_config(config);
let doc = akasha.extract_file("document.pdf")?;

// process tables
for table in &doc.tables {
    println!("extracted {} cells at {:.1}% confidence", 
             table.cells.len(), 
             table.confidence * 100.0);
}

architecture

three design principles:

  1. unified intelligence - single api understands document structure
  2. zero-cost abstractions - rust performance with high-level api
  3. production first - confidence scores, error handling, observability

extraction pipeline

pdf β†’ parse β†’ detect regions β†’ extract parallel {
    β†’ text (with positions)
    β†’ tables (lattice + stream)
    β†’ images (with ocr fallback)
    β†’ structure (semantic tree)
} β†’ merge β†’ validate β†’ output

key modules

  • akasha-core - rust extraction engine
  • akasha-py - python bindings via pyo3
  • akasha-js - node bindings via napi-rs
  • akasha-server - http api for scaling

advanced

structure-aware chunking

# never splits tables or paragraphs mid-content
chunks = doc.chunk(
    strategy="semantic",  # respects document structure
    max_tokens=512,      # embedding model limit
    overlap=50,          # context preservation
    preserve_tables=True # keep tables intact
)

# each chunk includes metadata
for chunk in chunks:
    print(f"content: {chunk.content}")
    print(f"location: page {chunk.page}, bbox {chunk.bbox}")
    print(f"context: {' > '.join(chunk.breadcrumb)}")
    print(f"confidence: {chunk.confidence}")

custom models

# bring your own onnx models
akasha = Akasha(
    table_model="models/custom_table.onnx",
    ocr_model="models/custom_ocr.onnx"
)

self-hosting

# docker-compose.yml
version: '3.8'
services:
  akasha:
    image: sybilstudio/akasha:latest
    environment:
      - WORKERS=auto
      - CACHE_SIZE=1000
    volumes:
      - ./models:/models
    ports:
      - "8080:8080"

performance details

optimizations that matter:

  • memory pooling - reuse allocations across pages
  • simd text processing - vectorized string operations
  • parallel extraction - rayon splits work across cores
  • lazy evaluation - compute only requested fields
  • zero-copy parsing - minimal allocations

profiled with 1000+ real pdfs:

p50 latency: 18ms per page
p99 latency: 45ms per page  
memory: 0.45mb per page
cpu: scales linearly with cores

limitations

  • scanned pdfs require ocr (slower path)
  • complex layouts may need manual hints
  • formula extraction in beta
  • rtl languages experimental

roadmap

  • core extraction engine
  • table detection (lattice + stream)
  • structure-aware chunking
  • python/js bindings
  • formula extraction (latex output)
  • chart data extraction
  • incremental processing
  • distributed mode

contributing

# setup
git clone https://github.com/sybil-studio/akasha
cd akasha
cargo build --release

# test
cargo test
cargo bench

# submit pr
git checkout -b feature
git commit -m "feat: description"
git push origin feature

see contributing.md for details.

license

apache 2.0

acknowledgments

built on solid foundations:

  • lopdf for pdf parsing
  • rayon for parallelism
  • pyo3 for python bindings

built by

sybil studio

About

πŸ¦€ a high-performance, AI-first PDF intelligence engine in Rust. powers structure-aware chunking for RAG, advanced table extraction, and universal language bindings.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors