Tiny ICF: Compressed Character-Level Word Frequency Estimation

A tiny neural network (<50k parameters) that predicts word frequency (ICF) from character patterns, enabling zero-shot token filtering and weighting without massive dictionaries.

Quick Start

# Install dependencies
uv sync

# Train a model
uv run python -m tiny_icf.train --data data/word_frequency.csv

# Predict ICF scores
uv run python -m tiny_icf.predict --model models/model.pt --words "the xylophone qzxbjk"

Features

Tiny: <50k parameters (~160 KB)
Fast: <1ms inference per word
Universal: Works with any UTF-8 language
Generalizes: Handles unseen words, typos, neologisms
Multi-task: ICF prediction, language detection, temporal analysis, text reduction

Documentation

See docs/ for detailed documentation:

PROJECT_OVERVIEW.md - What we're building and why
TECHNICAL_MATHEMATICAL_DESCRIPTION.md - Mathematical formulation
INFORMATION_THEORETIC_CONSTRAINTS.md - Kolmogorov complexity analysis
ALL_TASKS_STRUCTURE_ANALYSIS.md - Analysis of all tasks

Tasks

ICF Prediction: word → ICF score (0.0=common, 1.0=rare)
Text Reduction: Optimize word dropping to minimize embedding regret
Temporal ICF: Predict ICF across decades (1800s, 1900s, 2000s)
Language Detection: Detect language from character patterns
Era Classification: Classify historical era (archaic, modern, contemporary)
Multi-Task: Unified model for all tasks with AMOO

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/typst		docs/typst
rust		rust
scripts		scripts
src/tiny_icf		src/tiny_icf
tests		tests
.gitignore		.gitignore
Dockerfile.batch		Dockerfile.batch
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
runpod_config.json		runpod_config.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tiny ICF: Compressed Character-Level Word Frequency Estimation

Quick Start

Features

Documentation

Tasks

License

About

Uh oh!

Releases

Packages

Languages

arclabs561/tiny-icf

Folders and files

Latest commit

History

Repository files navigation

Tiny ICF: Compressed Character-Level Word Frequency Estimation

Quick Start

Features

Documentation

Tasks

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages