"We've been training bigger models on more data. What if we've had it backwards?"
The AI industry spends billions training larger models on massive datasets. But here's what nobody talks about:
Most training data is garbage.
- AI-generated slop polluting the web
- Repetitive, low-quality content drowning out signal
- Spam, ads, and noise masquerading as knowledge
- Models need billions of parameters just to wade through the mess
We compress models after training. Why? To hide the fact that they memorized terabytes of junk.
What if the problem isn't model size—it's data quality?
Clean data → smaller, smarter models
Instead of throwing more compute at the problem, we're addressing it at the source. We don't just theorize about data quality—we've proven its impact on training efficiency.
Phase 1 (Complete): We can measure quality. Reliably. At scale.
Phase 2 (Complete): We trained models on filtered vs. raw data. The results shocked us.
Phase 3 (In Progress): Domain-specific auditors for code, math, and science.
We trained two identical 100M parameter models to answer one question:
Can quality-filtered data match raw data performance with fewer tokens?
| Model | Training Data | Tokens | Time | Cost |
|---|---|---|---|---|
| Model A (Baseline) | Raw Common Crawl | 700M | 3.8 hrs | $2.95 |
| Model B (Filtered) | Quality ≥ 0.7 | 500M | 2.7 hrs | $2.10 |
Identical architecture: d10 (10 layers, 640 dim, ~100M parameters) The only difference: Data quality.
Model A (Raw Data): Final Loss: 4.3831
Model B (Filtered Data): Final Loss: 4.4375
Difference: +1.2% (negligible)
Model B achieved comparable performance using:
- 29% less data (500M vs 700M tokens)
- 29% less training time (2.7 vs 3.8 hours)
- 29% lower cost ($2.10 vs $2.95)
Translation: Quality filtering enables you to train with 30% less compute while maintaining performance.
Open Source Models:
Before training models, we audited 50,000+ samples across 5 popular LLM training datasets using nanochat-d32 as our quality baseline.
| Dataset | Quality Score | Keep Rate | Perplexity |
|---|---|---|---|
| 🥇 FineWeb-Edu | 0.987 | 100% | 15.55 |
| 🥈 The Pile | 0.956 | 98.2% | 19.78 |
| 🥉 Common Crawl | 0.945 | 97.6% | 29.49 |
| C4 | 0.905 | 93.0% | 39.88 |
| 📉 Social Media | 0.385 | 33.3% | 1008.55 |
- "Unfiltered" Common Crawl beats "cleaned" C4 - Preprocessing ≠ Quality
- Social media scores 60% lower - Not all text is training data
- Massive quality variance exists - Even in "curated" datasets
- Our auditor successfully distinguishes tiers - Validation works
The implication: Most datasets have room for 30-50% improvement through quality-aware filtering.
The problem with one-size-fits-all:
A model trained on general text can't accurately judge code quality. Or math proofs. Or scientific rigor.
The solution:
Domain-specific auditors trained on high-quality data from their domain.
- 💻 Coding Auditor - Scores based on syntax, structure, documentation, algorithmic complexity
- 🔢 Math Auditor - Evaluates problem clarity, solution steps, proof correctness
- 🔬 Science Auditor - Measures citation density, technical vocabulary, experimental rigor
- 🌍 General Auditor - Optimized for broad English text (current approach)
Each auditor uses domain-specific metrics to identify genuinely high-quality data, not just data that looks like its training set.
# Clone and setup
git clone https://github.com/vitalune/metagnosis.git
cd metagnosis
# Install dependencies
pip install -r requirements.txt
# Download the auditor model
python scripts/setup.py
# Audit your dataset
python scripts/audit_dataset.py \
--input your_data.jsonl \
--output audit_results.json \
--threshold 0.7
# Results include:
# - Quality scores for each sample
# - Filtered dataset (high-quality only)
# - Statistics and visualizations# Use your filtered data for training
python scripts/train_model.py \
--data audit_results_filtered.jsonl \
--model_size d10 \
--output runs/my_model
# Compare against baseline (raw data)
python scripts/compare_models.py \
--baseline runs/baseline \
--filtered runs/my_model📖 Full documentation → (coming soon)
- Auditor Model: nanochat-d32 (1.9B params)
- Training Framework: PyTorch + nanochat architecture
- Infrastructure: 2× NVIDIA RTX 6000 Ada
- Datasets: FineWeb-Edu, The Pile, C4, Common Crawl, Social Media
- Phase 1: Dataset quality analysis
- Phase 2: Training experiment validation
- Open source trained models
- Phase 3A: Coding auditor development
- Phase 3B: Math auditor development
- Benchmark evaluations (HumanEval, GSM8K, HellaSwag)
- Research paper submission
- Phase 4: Open source Python package (
pip install metagnosis) - Pre-trained auditor models for all domains
- Documentation site and tutorials
- Community beta testing
- Phase 5: Product discovery and user research
- Case studies with partner organizations
- Scale experiments to larger models (d20+)
- Explore commercialization options
- Found an issue? Open an issue
- Have an idea? Start a discussion
- Want to collaborate? Let's talk
If you use this work in your research, please cite:
@software{oren2025,
title={Oren: Quality Auditing for LLM Training Data},
author={Valizadeh, Amir},
year={2025},
url={https://github.com/vitalune/Oren},
note={Phase 2: Validated 29\% training efficiency improvement through quality filtering}
}Acknowledging nanochat:
@misc{karpathy2025nanochat,
author = {Andrej Karpathy},
title = {nanochat: The Best ChatGPT That \$100 Can Buy},
year = {2025},
howpublished = {\url{https://github.com/karpathy/nanochat}},
note = {Thanks for open-sourcing the nanochat-d32 model}
}- 📧 Email: amirvalizadeh161@email.com
- 💼 LinkedIn: Amir Valizadeh
- 🐦 X: @vitalune
If this project has been useful to you, consider giving it a ⭐!


