A tiny neural network (<50k parameters) that predicts word frequency (ICF) from character patterns, enabling zero-shot token filtering and weighting without massive dictionaries.
# Install dependencies
uv sync
# Train a model
uv run python -m tiny_icf.train --data data/word_frequency.csv
# Predict ICF scores
uv run python -m tiny_icf.predict --model models/model.pt --words "the xylophone qzxbjk"- Tiny: <50k parameters (~160 KB)
- Fast: <1ms inference per word
- Universal: Works with any UTF-8 language
- Generalizes: Handles unseen words, typos, neologisms
- Multi-task: ICF prediction, language detection, temporal analysis, text reduction
See docs/ for detailed documentation:
PROJECT_OVERVIEW.md- What we're building and whyTECHNICAL_MATHEMATICAL_DESCRIPTION.md- Mathematical formulationINFORMATION_THEORETIC_CONSTRAINTS.md- Kolmogorov complexity analysisALL_TASKS_STRUCTURE_ANALYSIS.md- Analysis of all tasks
- ICF Prediction: word → ICF score (0.0=common, 1.0=rare)
- Text Reduction: Optimize word dropping to minimize embedding regret
- Temporal ICF: Predict ICF across decades (1800s, 1900s, 2000s)
- Language Detection: Detect language from character patterns
- Era Classification: Classify historical era (archaic, modern, contemporary)
- Multi-Task: Unified model for all tasks with AMOO
MIT