GitHub - kushalthaman/stillwater: End-to-end pre-training data curation toolkit built on Smallpond (DuckDB, 3FS, Arrow) and Ray

stillwater

Stillwater is an end-to-end distributed data curation toolkit built on smallpond for LLM pretraining.

It provides an end‑to‑end pipeline from filters, dedup, packing to training, supports local or Ray, and offers fast iteration and train-ready outputs, and is built on smallpond with deterministic, incremental runs and rich observability.

The goal of the stillwater is to be able to curate data quickly with smallpond's offerings with its unique distributed file storage system.

It compares well to text-dedup, Dolma, and Datatrove at scalable ingestion and preprocessing, and researchers may find it preferable for quick iteration on curation stages and to produce train-ready shards quickly.

installation

Python 3.10+ recommended.

pip install -e .
stillwater --help

documentation

CLI: stillwater --help
Config examples: configs/
Benchmarks: stillwater/benchmarks/

features

pipeline: normalization/LID/PII/harmful/quality filters, near‑dup, tokenization, packs to training shards, and supports PyTorch IterableDataset.
dedup stack: Exact, Bloom, MinHash+LSH (vectorized hashing, optimal banding, heavy‑hitter caps), SimHash. We also support optional second‑stage rechecks (edit distance and tf‑idf).
runtime is supported by smallpond (Arrow/DuckDB/3FS) with Ray parallelism.
Tables/Ops: parquet tables; compaction/vacuum; directory snapshots; (Iceberg/Delta scaffolding in config).

Quickstart

python -m stillwater.examples.make_synth --out /tmp/synth.jsonl --n 200
stillwater all --input /tmp/synth.jsonl --output /tmp/stw_out --config configs/cc_en.yaml

CLI Examples

# extract and normalize
stillwater extract   --input /path/input.jsonl  --output /path/out/01_extract   --config configs/cc_en.yaml
stillwater normalize --input /path/out/01_extract --output /path/out/02_normalize --config configs/cc_en.yaml

# filters
stillwater lid          --input /path/out/02_normalize --output /path/out/03_lid          --config configs/cc_en.yaml
stillwater pii          --input /path/out/03_lid       --output /path/out/04_pii          --config configs/cc_en.yaml
stillwater harmful      --input /path/out/04_pii       --output /path/out/05_harmful      --config configs/cc_en.yaml
stillwater quality_rules --input /path/out/05_harmful  --output /path/out/06_quality_rules --config configs/cc_en.yaml
stillwater quality_clf   --input /path/out/06_quality_rules --output /path/out/07_quality_clf --config configs/cc_en.yaml

# dedup (exact, MinHash, SimHash)
stillwater dedup_exact  --input /path/out/07_quality_clf --output /path/out/08_exact --config configs/cc_en.yaml
stillwater dedup_fuzzy  --input /path/out/08_exact       --output /path/out/08b_minhash --config configs/cc_en.yaml
stillwater dedup_simhash --input /path/out/08_exact      --output /path/out/08c_simhash --config configs/cc_en.yaml

# edges to clusters
stillwater dedup_edges  --input /path/out/08_exact       --output /path/out/08d_edges --config configs/cc_en.yaml
stillwater clusters     --input /path/out/08d_edges      --output /path/out/08e_clusters --config configs/cc_en.yaml

# tokenize, pack and export
stillwater tokenize --input /path/out/08b_minhash --output /path/out/09_tok  --config configs/cc_en.yaml
stillwater pack     --input /path/out/09_tok      --output /path/out/10_pack --config configs/cc_en.yaml
stillwater mixture  --input /path/out/10_pack     --output /path/out/11_mix  --config configs/cc_en.yaml
stillwater export   --input /path/out/10_pack     --output /path/out/12_export --config configs/cc_en.yaml

config

dedup:
  minhash:
    n_hashes: 128
    n_bands: 32     # overridden by optimal_param if 0
    ngram: 5
    jaccard_min: 0.8
    min_token_len: 1
    use_vectorized: true
    token_hash: sha1   # or xxh3
    prune_type: tfidf  # none|edit|jaccard|tfidf
    prune_threshold: 0.7
    emit_all: false
    emit_edges: false
    max_band_group: 2000
    granularity: doc    # line|doc|domain
    incremental: false
platform:
  backend: ray         # ray|spark
  num_executors: 4
tables:
  format: parquet      # parquet|iceberg|delta
  dedup_edges: tables/dedup_edges
  clusters: tables/clusters

ops helpers

from stillwater.ops.tables import compact_parquet, vacuum_dir, snapshot_dir
compacted = compact_parquet('/path/out/08b_minhash', target_mb=512)
snapshot = snapshot_dir(compacted)
vacuum_dir('/path/out/08b_minhash')

PyTorch loader

from stillwater.loaders.torch_iterable import PackedIterableDataset
ds = PackedIterableDataset(manifest_dir='/path/out/10_pack', shuffle=True, shuffle_buffer=20000)
for row in ds:
  ids = row["packed_tokens"]  # list[int]
  # feed to your collator/model

ops CLI

# compact small parquet files to ~512MB parts
stillwater ops compact --input /path/out/08b_minhash --target-mb 512 --output /path/out/08b_minhash_compact

# remove the temp files and empty dirs
stillwater ops vacuum --input /path/out/08b_minhash

# make a directory snapshot copy
stillwater ops snapshot --input /path/out/08b_minhash_compact --target-root /path/_snapshots

bloom filter exact dedup

stillwater dedup_bloom \
  --input /path/out/07_quality_clf \
  --output /path/out/08_exact_bloom \
  --config configs/cc_en.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
stillwater		stillwater
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

stillwater

installation

documentation

features

Quickstart

CLI Examples

config

ops helpers

PyTorch loader

ops CLI

bloom filter exact dedup

About

Uh oh!

Releases

Packages

Languages

kushalthaman/stillwater

Folders and files

Latest commit

History

Repository files navigation

stillwater

installation

documentation

features

Quickstart

CLI Examples

config

ops helpers

PyTorch loader

ops CLI

bloom filter exact dedup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages