A structured, public-facing dataset repository for 4 product lines:
- Protein Entity
- RNA Entity
- Small Molecule Entity
- Cross-Entity Interactions (PPI / PSI / RPI)
If you only care about downloading datasets, start here:
- Unified product index:
release/index.json - Product pointers:
products/*/current.json - One-command downloader:
scripts/download_dataset.py
# Protein
python3 scripts/download_dataset.py --product protein --version latest
# RNA
python3 scripts/download_dataset.py --product rna --version latest --merge-chunks
# Molecule
python3 scripts/download_dataset.py --product molecule --version latest
# Interaction
python3 scripts/download_dataset.py --product interaction --version latest --merge-chunks| Product | Scope | Latest pointer | Distribution mode | Primary manifest |
|---|---|---|---|---|
| Protein | Human protein L1 entity + provenance | products/protein/current.json |
repository_snapshot |
pipelines/protein/reports/protein_master_v6.manifest.json |
| RNA | Human RNA L1/L2 | products/rna/current.json |
github_release |
build/releases/rna-l1l2-v2/manifest.json |
| Molecule | Small-molecule L1/L2 | products/molecule/current.json |
github_release |
dist/molecule-l1l2-v2/manifest.json |
| Interaction | Cross-entity L2 (PPI/PSI/RPI) | products/interaction/current.json |
github_release |
pipelines/interaction_release_local/reports/interaction_l2_v1.release_assets_manifest.json |
Source of truth for latest versions:
release/index.json
After running downloader:
downloads/
protein/<version>/
rna/<version>/
molecule/<version>/
interaction/<version>/
For large assets:
- use
--merge-chunksto merge*.part.000files - use
--decompressto decompress.gz/.zst
Run before trusting a release snapshot:
# rebuild unified index
python3 scripts/build_release_index.py
# validate release metadata schema + path integrity
python3 scripts/validate_release_index.py \
--index release/index.json \
--schema release/schema/index.schema.json \
--repo-root .
# consistency checks (manifest / validation / asset integrity)
python3 scripts/check_release_consistency.py \
--index release/index.json \
--out release/consistency_report.jsonOptional regression tests:
pytest -q tests/releaseproducts/ # product metadata and latest release pointers
release/ # unified index + schema + consistency report
pipelines/ # per-domain ETL contracts/reports/scripts
docs/ # architecture, release policy, quickstart
scripts/ # index build / validation / consistency / download tooling
tests/ # release metadata regression tests
Key docs:
docs/architecture.mddocs/release-policy.mddocs/quickstart.md
This repository contains historical pipeline modules. For external consumption, do not start from random pipeline folders.
Always start from:
release/index.jsonproducts/*/current.jsonscripts/download_dataset.py
This is the stable public contract.
MIT — see LICENSE.