akasha

high-performance pdf intelligence engine

problem

current pdf extraction requires stitching together multiple libraries with heavy dependencies. you need pdfplumber for text, camelot for tables (with opencv), tesseract for ocr. the result: 500mb+ of dependencies, version conflicts, and mediocre performance.

solution

akasha: single rust library, zero external dependencies, understands document structure.

# before: dependency hell
import pdfplumber  
import camelot     # needs opencv
import pytesseract # needs tesseract binary
import fitz        # needs mupdf

# after: one import
from akasha import Akasha
doc = Akasha().extract_all("document.pdf")

benchmarks

100-page pdf extraction:

akasha:     1.8s   45mb ram   98.5% accuracy   0 dependencies
pdfplumber: 12.3s  450mb ram  92.1% accuracy   15+ dependencies  
camelot:    18.7s  380mb ram  94.3% accuracy   20+ dependencies
llamaparse: 25.4s  890mb ram  96.2% accuracy   10+ dependencies

core features

unified extraction - text, tables, images, structure in one pass
structure-aware chunking - respects semantic boundaries for rag
source tracking - every extraction traced to exact coordinates
confidence scoring - reliability metrics on all outputs
parallel by default - uses all cores via rayon
streaming support - process while downloading

installation

rust (primary)

# Cargo.toml
[dependencies]
akasha = "0.1"

# or with cargo-edit
cargo add akasha

# with specific features
akasha = { version = "0.1", features = ["ocr", "ml-models"] }

python bindings

pip install akasha

node bindings

npm install akasha

docker

docker run -p 8080:8080 sybilstudio/akasha

quickstart

from akasha import Akasha

# extract everything
doc = Akasha().extract_all("document.pdf")

# tables with confidence
for table in doc.tables:
    if table.confidence > 0.9:
        df = table.to_pandas()
        
# smart chunking for rag
chunks = doc.chunk(strategy="semantic", max_tokens=512)

# export formats
markdown = doc.to_markdown()
json_data = doc.to_json()

rust usage

use akasha::{Akasha, Config};

let config = Config::builder()
    .parallel(true)
    .confidence_threshold(0.9)
    .build();

let akasha = Akasha::with_config(config);
let doc = akasha.extract_file("document.pdf")?;

// process tables
for table in &doc.tables {
    println!("extracted {} cells at {:.1}% confidence", 
             table.cells.len(), 
             table.confidence * 100.0);
}

architecture

three design principles:

unified intelligence - single api understands document structure
zero-cost abstractions - rust performance with high-level api
production first - confidence scores, error handling, observability

extraction pipeline

pdf → parse → detect regions → extract parallel {
    → text (with positions)
    → tables (lattice + stream)
    → images (with ocr fallback)
    → structure (semantic tree)
} → merge → validate → output

key modules

akasha-core - rust extraction engine
akasha-py - python bindings via pyo3
akasha-js - node bindings via napi-rs
akasha-server - http api for scaling

advanced

structure-aware chunking

# never splits tables or paragraphs mid-content
chunks = doc.chunk(
    strategy="semantic",  # respects document structure
    max_tokens=512,      # embedding model limit
    overlap=50,          # context preservation
    preserve_tables=True # keep tables intact
)

# each chunk includes metadata
for chunk in chunks:
    print(f"content: {chunk.content}")
    print(f"location: page {chunk.page}, bbox {chunk.bbox}")
    print(f"context: {' > '.join(chunk.breadcrumb)}")
    print(f"confidence: {chunk.confidence}")

custom models

# bring your own onnx models
akasha = Akasha(
    table_model="models/custom_table.onnx",
    ocr_model="models/custom_ocr.onnx"
)

self-hosting

# docker-compose.yml
version: '3.8'
services:
  akasha:
    image: sybilstudio/akasha:latest
    environment:
      - WORKERS=auto
      - CACHE_SIZE=1000
    volumes:
      - ./models:/models
    ports:
      - "8080:8080"

performance details

optimizations that matter:

memory pooling - reuse allocations across pages
simd text processing - vectorized string operations
parallel extraction - rayon splits work across cores
lazy evaluation - compute only requested fields
zero-copy parsing - minimal allocations

profiled with 1000+ real pdfs:

p50 latency: 18ms per page
p99 latency: 45ms per page  
memory: 0.45mb per page
cpu: scales linearly with cores

limitations

scanned pdfs require ocr (slower path)
complex layouts may need manual hints
formula extraction in beta
rtl languages experimental

roadmap

contributing

# setup
git clone https://github.com/sybil-studio/akasha
cd akasha
cargo build --release

# test
cargo test
cargo bench

# submit pr
git checkout -b feature
git commit -m "feat: description"
git push origin feature

see contributing.md for details.

license

apache 2.0

acknowledgments

built on solid foundations:

lopdf for pdf parsing
rayon for parallelism
pyo3 for python bindings

built by

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.ebss		.ebss
.github/workflows		.github/workflows
EBSS/docs		EBSS/docs
akasha-core		akasha-core
akasha-js		akasha-js
akasha-py		akasha-py
akasha-server		akasha-server
akasha-wasm		akasha-wasm
ebss-portable		ebss-portable
references		references
scripts		scripts
to-do/2025/august		to-do/2025/august
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
EBSS-PORTABLE-SETUP.md		EBSS-PORTABLE-SETUP.md
LICENSE		LICENSE
README.md		README.md
ebss-portable-20250821.tar.gz		ebss-portable-20250821.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

akasha

problem

solution

benchmarks

core features

installation

rust (primary)

python bindings

node bindings

docker

quickstart

rust usage

architecture

extraction pipeline

key modules

advanced

structure-aware chunking

custom models

self-hosting

performance details

limitations

roadmap

contributing

license

acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

akasha

problem

solution

benchmarks

core features

installation

rust (primary)

python bindings

node bindings

docker

quickstart

rust usage

architecture

extraction pipeline

key modules

advanced

structure-aware chunking

custom models

self-hosting

performance details

limitations

roadmap

contributing

license

acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages