Stillwater is an end-to-end distributed data curation toolkit built on smallpond for LLM pretraining.
It provides an end‑to‑end pipeline from filters, dedup, packing to training, supports local or Ray, and offers fast iteration and train-ready outputs, and is built on smallpond with deterministic, incremental runs and rich observability.
The goal of the stillwater is to be able to curate data quickly with smallpond's offerings with its unique distributed file storage system.
It compares well to text-dedup, Dolma, and Datatrove at scalable ingestion and preprocessing, and researchers may find it preferable for quick iteration on curation stages and to produce train-ready shards quickly.
Python 3.10+ recommended.
pip install -e .
stillwater --help- CLI:
stillwater --help - Config examples:
configs/ - Benchmarks:
stillwater/benchmarks/
- pipeline: normalization/LID/PII/harmful/quality filters, near‑dup, tokenization, packs to training shards, and supports PyTorch IterableDataset.
- dedup stack: Exact, Bloom, MinHash+LSH (vectorized hashing, optimal banding, heavy‑hitter caps), SimHash. We also support optional second‑stage rechecks (edit distance and tf‑idf).
- runtime is supported by smallpond (Arrow/DuckDB/3FS) with Ray parallelism.
- Tables/Ops: parquet tables; compaction/vacuum; directory snapshots; (Iceberg/Delta scaffolding in config).
python -m stillwater.examples.make_synth --out /tmp/synth.jsonl --n 200
stillwater all --input /tmp/synth.jsonl --output /tmp/stw_out --config configs/cc_en.yaml# extract and normalize
stillwater extract --input /path/input.jsonl --output /path/out/01_extract --config configs/cc_en.yaml
stillwater normalize --input /path/out/01_extract --output /path/out/02_normalize --config configs/cc_en.yaml
# filters
stillwater lid --input /path/out/02_normalize --output /path/out/03_lid --config configs/cc_en.yaml
stillwater pii --input /path/out/03_lid --output /path/out/04_pii --config configs/cc_en.yaml
stillwater harmful --input /path/out/04_pii --output /path/out/05_harmful --config configs/cc_en.yaml
stillwater quality_rules --input /path/out/05_harmful --output /path/out/06_quality_rules --config configs/cc_en.yaml
stillwater quality_clf --input /path/out/06_quality_rules --output /path/out/07_quality_clf --config configs/cc_en.yaml
# dedup (exact, MinHash, SimHash)
stillwater dedup_exact --input /path/out/07_quality_clf --output /path/out/08_exact --config configs/cc_en.yaml
stillwater dedup_fuzzy --input /path/out/08_exact --output /path/out/08b_minhash --config configs/cc_en.yaml
stillwater dedup_simhash --input /path/out/08_exact --output /path/out/08c_simhash --config configs/cc_en.yaml
# edges to clusters
stillwater dedup_edges --input /path/out/08_exact --output /path/out/08d_edges --config configs/cc_en.yaml
stillwater clusters --input /path/out/08d_edges --output /path/out/08e_clusters --config configs/cc_en.yaml
# tokenize, pack and export
stillwater tokenize --input /path/out/08b_minhash --output /path/out/09_tok --config configs/cc_en.yaml
stillwater pack --input /path/out/09_tok --output /path/out/10_pack --config configs/cc_en.yaml
stillwater mixture --input /path/out/10_pack --output /path/out/11_mix --config configs/cc_en.yaml
stillwater export --input /path/out/10_pack --output /path/out/12_export --config configs/cc_en.yamldedup:
minhash:
n_hashes: 128
n_bands: 32 # overridden by optimal_param if 0
ngram: 5
jaccard_min: 0.8
min_token_len: 1
use_vectorized: true
token_hash: sha1 # or xxh3
prune_type: tfidf # none|edit|jaccard|tfidf
prune_threshold: 0.7
emit_all: false
emit_edges: false
max_band_group: 2000
granularity: doc # line|doc|domain
incremental: false
platform:
backend: ray # ray|spark
num_executors: 4
tables:
format: parquet # parquet|iceberg|delta
dedup_edges: tables/dedup_edges
clusters: tables/clustersfrom stillwater.ops.tables import compact_parquet, vacuum_dir, snapshot_dir
compacted = compact_parquet('/path/out/08b_minhash', target_mb=512)
snapshot = snapshot_dir(compacted)
vacuum_dir('/path/out/08b_minhash')from stillwater.loaders.torch_iterable import PackedIterableDataset
ds = PackedIterableDataset(manifest_dir='/path/out/10_pack', shuffle=True, shuffle_buffer=20000)
for row in ds:
ids = row["packed_tokens"] # list[int]
# feed to your collator/model# compact small parquet files to ~512MB parts
stillwater ops compact --input /path/out/08b_minhash --target-mb 512 --output /path/out/08b_minhash_compact
# remove the temp files and empty dirs
stillwater ops vacuum --input /path/out/08b_minhash
# make a directory snapshot copy
stillwater ops snapshot --input /path/out/08b_minhash_compact --target-root /path/_snapshotsstillwater dedup_bloom \
--input /path/out/07_quality_clf \
--output /path/out/08_exact_bloom \
--config configs/cc_en.yaml