Oren

"We've been training bigger models on more data. What if we've had it backwards?"

The Problem

The AI industry spends billions training larger models on massive datasets. But here's what nobody talks about:

Most training data is garbage.

AI-generated slop polluting the web
Repetitive, low-quality content drowning out signal
Spam, ads, and noise masquerading as knowledge
Models need billions of parameters just to wade through the mess

We compress models after training. Why? To hide the fact that they memorized terabytes of junk.

What if the problem isn't model size—it's data quality?

Our Thesis (Now Validated)

Clean data → smaller, smarter models

Instead of throwing more compute at the problem, we're addressing it at the source. We don't just theorize about data quality—we've proven its impact on training efficiency.

Phase 1 (Complete): We can measure quality. Reliably. At scale.
Phase 2 (Complete): We trained models on filtered vs. raw data. The results shocked us.
Phase 3 (In Progress): Domain-specific auditors for code, math, and science.

Phase 2: The Training Experiment

We trained two identical 100M parameter models to answer one question:
Can quality-filtered data match raw data performance with fewer tokens?

The Setup

Model	Training Data	Tokens	Time	Cost
Model A (Baseline)	Raw Common Crawl	700M	3.8 hrs	$2.95
Model B (Filtered)	Quality ≥ 0.7	500M	2.7 hrs	$2.10

Identical architecture: d10 (10 layers, 640 dim, ~100M parameters) The only difference: Data quality.

The Results

Model A (Raw Data):       Final Loss: 4.3831
Model B (Filtered Data):  Final Loss: 4.4375

Difference: +1.2% (negligible)

Model B achieved comparable performance using:

29% less data (500M vs 700M tokens)
29% less training time (2.7 vs 3.8 hours)
29% lower cost ($2.10 vs $2.95)

Translation: Quality filtering enables you to train with 30% less compute while maintaining performance.

Open Source Models:

🤗 Model A (Raw Data)
🤗 Model B (Filtered Data)

📊 See full Phase 2 analysis →

Phase 1: Dataset Quality Rankings

Before training models, we audited 50,000+ samples across 5 popular LLM training datasets using nanochat-d32 as our quality baseline.

The Rankings

Dataset	Quality Score	Keep Rate	Perplexity
🥇 FineWeb-Edu	0.987	100%	15.55
🥈 The Pile	0.956	98.2%	19.78
🥉 Common Crawl	0.945	97.6%	29.49
C4	0.905	93.0%	39.88
📉 Social Media	0.385	33.3%	1008.55

Key Findings

"Unfiltered" Common Crawl beats "cleaned" C4 - Preprocessing ≠ Quality
Social media scores 60% lower - Not all text is training data
Massive quality variance exists - Even in "curated" datasets
Our auditor successfully distinguishes tiers - Validation works

The implication: Most datasets have room for 30-50% improvement through quality-aware filtering.

📊 See full Phase 1 analysis →

What's Next: Domain-Specific Auditors

The problem with one-size-fits-all:
A model trained on general text can't accurately judge code quality. Or math proofs. Or scientific rigor.

The solution:
Domain-specific auditors trained on high-quality data from their domain.

Coming Soon

💻 Coding Auditor - Scores based on syntax, structure, documentation, algorithmic complexity
🔢 Math Auditor - Evaluates problem clarity, solution steps, proof correctness
🔬 Science Auditor - Measures citation density, technical vocabulary, experimental rigor
🌍 General Auditor - Optimized for broad English text (current approach)

Each auditor uses domain-specific metrics to identify genuinely high-quality data, not just data that looks like its training set.

Quick Start

Audit Your Own Dataset

# Clone and setup
git clone https://github.com/vitalune/metagnosis.git
cd metagnosis

# Install dependencies
pip install -r requirements.txt

# Download the auditor model
python scripts/setup.py

# Audit your dataset
python scripts/audit_dataset.py \
    --input your_data.jsonl \
    --output audit_results.json \
    --threshold 0.7

# Results include:
# - Quality scores for each sample
# - Filtered dataset (high-quality only)
# - Statistics and visualizations

Train on Filtered Data

# Use your filtered data for training
python scripts/train_model.py \
    --data audit_results_filtered.jsonl \
    --model_size d10 \
    --output runs/my_model

# Compare against baseline (raw data)
python scripts/compare_models.py \
    --baseline runs/baseline \
    --filtered runs/my_model

📖 Full documentation → (coming soon)

Built With

Auditor Model: nanochat-d32 (1.9B params)
Training Framework: PyTorch + nanochat architecture
Infrastructure: 2× NVIDIA RTX 6000 Ada
Datasets: FineWeb-Edu, The Pile, C4, Common Crawl, Social Media

Roadmap

Q1 2025 ✅

Phase 1: Dataset quality analysis
Phase 2: Training experiment validation
Open source trained models

Q2 2025 🚧

Phase 3A: Coding auditor development
Phase 3B: Math auditor development
Benchmark evaluations (HumanEval, GSM8K, HellaSwag)
Research paper submission

Q3 2025 📋

Phase 4: Open source Python package (pip install metagnosis)
Pre-trained auditor models for all domains
Documentation site and tutorials
Community beta testing

Q4 2025 📋

Phase 5: Product discovery and user research
Case studies with partner organizations
Scale experiments to larger models (d20+)
Explore commercialization options

Contributing

Found an issue? Open an issue
Have an idea? Start a discussion
Want to collaborate? Let's talk

Citation

If you use this work in your research, please cite:

@software{oren2025,
  title={Oren: Quality Auditing for LLM Training Data},
  author={Valizadeh, Amir},
  year={2025},
  url={https://github.com/vitalune/Oren},
  note={Phase 2: Validated 29\% training efficiency improvement through quality filtering}
}

Acknowledging nanochat:

@misc{karpathy2025nanochat,
  author       = {Andrej Karpathy},
  title        = {nanochat: The Best ChatGPT That \$100 Can Buy},
  year         = {2025},
  howpublished = {\url{https://github.com/karpathy/nanochat}},
  note         = {Thanks for open-sourcing the nanochat-d32 model}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
models/nanochat-d32		models/nanochat-d32
nanochat		nanochat
results		results
scripts		scripts
.gitignore		.gitignore
Oren-logo.png		Oren-logo.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oren

The Problem

Our Thesis (Now Validated)

Phase 2: The Training Experiment

The Setup

The Results

Phase 1: Dataset Quality Rankings

The Rankings

Key Findings

What's Next: Domain-Specific Auditors

Coming Soon

Quick Start

Audit Your Own Dataset

Train on Filtered Data

Built With

Roadmap

Q1 2025 ✅

Q2 2025 🚧

Q3 2025 📋

Q4 2025 📋

Contributing

Citation

Media & Recognition

Contact & Community

Star History

About

Uh oh!

Releases

Packages

Languages

vitalune/Oren

Folders and files

Latest commit

History

Repository files navigation

Oren

The Problem

Our Thesis (Now Validated)

Phase 2: The Training Experiment

The Setup

The Results

Phase 1: Dataset Quality Rankings

The Rankings

Key Findings

What's Next: Domain-Specific Auditors

Coming Soon

Quick Start

Audit Your Own Dataset

Train on Filtered Data

Built With

Roadmap

Q1 2025 ✅

Q2 2025 🚧

Q3 2025 📋

Q4 2025 📋

Contributing

Citation

Media & Recognition

Contact & Community

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages