Skip to content

kushalthaman/stillwater

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stillwater

Stillwater is an end-to-end distributed data curation toolkit built on smallpond for LLM pretraining.

It provides an end‑to‑end pipeline from filters, dedup, packing to training, supports local or Ray, and offers fast iteration and train-ready outputs, and is built on smallpond with deterministic, incremental runs and rich observability.

The goal of the stillwater is to be able to curate data quickly with smallpond's offerings with its unique distributed file storage system.

It compares well to text-dedup, Dolma, and Datatrove at scalable ingestion and preprocessing, and researchers may find it preferable for quick iteration on curation stages and to produce train-ready shards quickly.

installation

Python 3.10+ recommended.

pip install -e .
stillwater --help

documentation

  • CLI: stillwater --help
  • Config examples: configs/
  • Benchmarks: stillwater/benchmarks/

features

  • pipeline: normalization/LID/PII/harmful/quality filters, near‑dup, tokenization, packs to training shards, and supports PyTorch IterableDataset.
  • dedup stack: Exact, Bloom, MinHash+LSH (vectorized hashing, optimal banding, heavy‑hitter caps), SimHash. We also support optional second‑stage rechecks (edit distance and tf‑idf).
  • runtime is supported by smallpond (Arrow/DuckDB/3FS) with Ray parallelism.
  • Tables/Ops: parquet tables; compaction/vacuum; directory snapshots; (Iceberg/Delta scaffolding in config).

Quickstart

python -m stillwater.examples.make_synth --out /tmp/synth.jsonl --n 200
stillwater all --input /tmp/synth.jsonl --output /tmp/stw_out --config configs/cc_en.yaml

CLI Examples

# extract and normalize
stillwater extract   --input /path/input.jsonl  --output /path/out/01_extract   --config configs/cc_en.yaml
stillwater normalize --input /path/out/01_extract --output /path/out/02_normalize --config configs/cc_en.yaml

# filters
stillwater lid          --input /path/out/02_normalize --output /path/out/03_lid          --config configs/cc_en.yaml
stillwater pii          --input /path/out/03_lid       --output /path/out/04_pii          --config configs/cc_en.yaml
stillwater harmful      --input /path/out/04_pii       --output /path/out/05_harmful      --config configs/cc_en.yaml
stillwater quality_rules --input /path/out/05_harmful  --output /path/out/06_quality_rules --config configs/cc_en.yaml
stillwater quality_clf   --input /path/out/06_quality_rules --output /path/out/07_quality_clf --config configs/cc_en.yaml

# dedup (exact, MinHash, SimHash)
stillwater dedup_exact  --input /path/out/07_quality_clf --output /path/out/08_exact --config configs/cc_en.yaml
stillwater dedup_fuzzy  --input /path/out/08_exact       --output /path/out/08b_minhash --config configs/cc_en.yaml
stillwater dedup_simhash --input /path/out/08_exact      --output /path/out/08c_simhash --config configs/cc_en.yaml

# edges to clusters
stillwater dedup_edges  --input /path/out/08_exact       --output /path/out/08d_edges --config configs/cc_en.yaml
stillwater clusters     --input /path/out/08d_edges      --output /path/out/08e_clusters --config configs/cc_en.yaml

# tokenize, pack and export
stillwater tokenize --input /path/out/08b_minhash --output /path/out/09_tok  --config configs/cc_en.yaml
stillwater pack     --input /path/out/09_tok      --output /path/out/10_pack --config configs/cc_en.yaml
stillwater mixture  --input /path/out/10_pack     --output /path/out/11_mix  --config configs/cc_en.yaml
stillwater export   --input /path/out/10_pack     --output /path/out/12_export --config configs/cc_en.yaml

config

dedup:
  minhash:
    n_hashes: 128
    n_bands: 32     # overridden by optimal_param if 0
    ngram: 5
    jaccard_min: 0.8
    min_token_len: 1
    use_vectorized: true
    token_hash: sha1   # or xxh3
    prune_type: tfidf  # none|edit|jaccard|tfidf
    prune_threshold: 0.7
    emit_all: false
    emit_edges: false
    max_band_group: 2000
    granularity: doc    # line|doc|domain
    incremental: false
platform:
  backend: ray         # ray|spark
  num_executors: 4
tables:
  format: parquet      # parquet|iceberg|delta
  dedup_edges: tables/dedup_edges
  clusters: tables/clusters

ops helpers

from stillwater.ops.tables import compact_parquet, vacuum_dir, snapshot_dir
compacted = compact_parquet('/path/out/08b_minhash', target_mb=512)
snapshot = snapshot_dir(compacted)
vacuum_dir('/path/out/08b_minhash')

PyTorch loader

from stillwater.loaders.torch_iterable import PackedIterableDataset
ds = PackedIterableDataset(manifest_dir='/path/out/10_pack', shuffle=True, shuffle_buffer=20000)
for row in ds:
  ids = row["packed_tokens"]  # list[int]
  # feed to your collator/model

ops CLI

# compact small parquet files to ~512MB parts
stillwater ops compact --input /path/out/08b_minhash --target-mb 512 --output /path/out/08b_minhash_compact

# remove the temp files and empty dirs
stillwater ops vacuum --input /path/out/08b_minhash

# make a directory snapshot copy
stillwater ops snapshot --input /path/out/08b_minhash_compact --target-root /path/_snapshots

bloom filter exact dedup

stillwater dedup_bloom \
  --input /path/out/07_quality_clf \
  --output /path/out/08_exact_bloom \
  --config configs/cc_en.yaml

About

End-to-end pre-training data curation toolkit built on Smallpond (DuckDB, 3FS, Arrow) and Ray

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages