Research-use only: PRDT is a lab notebook for data-cleaning and QA. Do not deploy it for diagnosis or treatment decisions unless your workflow is IRB-approved, validated, and overseen by licensed clinicians with HIPAA-compliant safeguards.
Psych Research Data Toolkit (PRDT) is a reproducible CLI for cleaning, anonymizing, and summarizing mental-health CSVs. It handles HMAC-based identifiers, PHI scrubbing, descriptive stats, reliability checks, and visualization so trainees can ship trustworthy analyses without reinventing pipelines.
prdt demo— one-command demo with bundled sample data (outputs inoutputs/demo_*)prdt --config configs/anxiety.toml— full pipeline with the bundled profileprdt run --input data/examples/surveys.csv --outdir outputs/demo --score-cols phq9_total gad7_totalprdt doctor— environment check (Python, deps, key, input path if provided)--dry-runto validate without writing outputs;--html-reportto emit a simple HTML summary- Requirements: Python 3.9+ with pandas, numpy, matplotlib (installed via the wheel or
pip install -e ".[dev]")
Record a 60-second screen capture of prdt run plus sample plots and place it in docs/ (e.g., docs/prdt-demo.gif). Use it in admissions decks or lab walk-throughs.
- Normalize headers and basic CSV cleaning
- HMAC-based ID anonymization via
PRDT_ANON_KEY - Descriptives, Pearson correlations, Cronbach’s alpha & McDonald’s ω (overall + per-scale), missingness counts + percents (JSON)
- Optional alert thresholds for reliability and column-level missingness
- Automatic PHI detector (emails/phones/SSNs/etc.) with quarantine + alerts
- Built-in scoring for PHQ-9, GAD-7, PCL-5, AUDIT + custom scale definitions (alpha, omega, item-total stats)
- Data dictionary + run manifest per execution for reproducibility
- Drift detection compares scale means vs last run (
drift.json+ alerts) - Histograms for selected score columns + missingness bar chart
- Simple time-trend plot by participant
- CLI subcommands for focused workflows (
clean,stats,plot,run)
-
Create and activate the virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install PRDT and dev dependencies:
pip install -e ".[dev]" # optional pinned environment pip install -r requirements-lock.txt
-
Configure environment variables (export before running PRDT):
| Variable | Purpose | Example |
|---|---|---|
PRDT_ANON_KEY |
Required when participant_id exists; used for HMAC anonymization |
export PRDT_ANON_KEY="$(openssl rand -hex 32)" |
MPLCONFIGDIR / XDG_CACHE_HOME (optional) |
Point Matplotlib caches at writable dirs on locked-down machines | export MPLCONFIGDIR=$PWD/.cache/mpl |
PRDT_DISABLE_PLOTS (optional) |
Skip plot rendering for headless/CI runs | export PRDT_DISABLE_PLOTS=1 |
On Windows PowerShell use setx PRDT_ANON_KEY (New-Guid) and restart the shell.
Headless or locked-down machines: set PRDT_DISABLE_PLOTS=1 to skip Matplotlib renders and point MPLCONFIGDIR/XDG_CACHE_HOME at writable paths (examples above) to avoid cache errors.
-
Run the sample workflow to verify everything works:
prdt run --input data/examples/surveys.csv --outdir outputs/run1 \ --score-cols phq9_total gad7_total
Use
python -m prdt.cli run ...if the console script is unavailable. Add--skip-anonif you need to retainparticipant_idfor local debugging only. Fastest first run:prdt demo(bundled sample data, outputs underoutputs/demo_*, no config/key required).
| Command | Purpose | Example |
|---|---|---|
prdt clean |
Clean + anonymize CSV, emit interim_clean.csv and data dictionary |
prdt clean --input data/examples/surveys.csv --outdir outputs/clean |
prdt stats [--alpha] |
Validate score columns, compute descriptives/reliability/missingness; --alpha prints Cronbach’s α for --score-cols |
prdt stats --input data/examples/surveys.csv --outdir outputs/stats --score-cols phq9_total gad7_total --alpha |
prdt plot |
Generate histogram, trend, and missingness plots for selected columns | prdt plot --input data/examples/surveys.csv --outdir outputs/plots --score-cols phq9_total |
prdt run |
Full pipeline (clean + stats + plot), default command when omitted |
prdt run --input data/examples/surveys.csv --outdir outputs/run1 |
prdt doctor |
Environment check (Python, deps, key, input path) | prdt doctor |
--dry-run |
Validate inputs/config and stop before writing outputs | prdt run ... --dry-run |
--html-report |
Save a simple HTML summary alongside report.json |
prdt stats ... --html-report |
-
Create a TOML profile to avoid repeating CLI flags. Paths in the file are resolved relative to the config’s directory.
-
Define reliability groups under
[prdt.scales.<name>]so each scale gets its own Cronbach’s alpha and McDonald’s ω entries inreport.json. -
Configure custom scale scoring under
[prdt.score]and[prdt.score.definitions.*](items, method, output column). -
Configure alert thresholds under
[prdt.alerts]to highlight high missingness or low reliability inreport.json. -
Example (
configs/anxiety.toml):[prdt] command = "run" input = "../data/examples/surveys.csv" outdir = "../outputs/anxiety-profile" score_cols = ["phq9_total", "gad7_total"] skip_anon = false [prdt.score] scales = ["phq9", "gad7", "phq2_custom"] [prdt.score.definitions.phq2_custom] items = ["phq9_item1", "phq9_item2"] method = "sum" output = "phq2_score" [prdt.scales.phq9] items = ["phq9_item1", "phq9_item2"] [prdt.scales.gad7] items = ["gad7_item1", "gad7_item2"] [prdt.alerts] missing_pct = 10.0 [prdt.alerts.reliability] cronbach_alpha_min = 0.75 mcdonald_omega_min = 0.75 [prdt.schema] required = ["participant_id", "date"] [prdt.schema.types] phq9_item1 = "numeric" gad7_item1 = "numeric" [prdt.schema.ranges.phq9_item1] min = 0 max = 3 [prdt.phi] keywords = ["contact", "address"] ignore_columns = ["note"]
-
Add additional
prdt.schema.ranges.*tables for any numeric column that must stay within known bounds (alerts and manifests report violations). -
Invoke with
prdt --config configs/anxiety.toml(you can still override any option on the command line).
-
Ensure your virtualenv is active and
PRDT_ANON_KEYis set (see Quickstart). -
Execute the bundled profile (mirrors a typical PHQ-9/GAD-7 workflow):
prdt --config configs/anxiety.toml
-
Inspect outputs under
outputs/anxiety-profile/:interim_clean.csv: cleaned + anonymized data.report.json: descriptives, correlations, reliability, missingness, alerts.alerts.json: only present when a threshold is exceeded.data_dictionary.csv: snapshot of every column’s dtype and completeness.run_manifest.json: provenance (version, git SHA, config hash, input hash, timestamps).phi_quarantine.csv: columns removed for PHI risk (e.g., emails incontact).hist_phq9_total.png,hist_gad7_total.png,trend_phq9_total.png,missingness.png.scale_scoressection insidereport.jsonsummarizing mean/std and severity labels based on cutoffs.
Sample alerts.json (generated because every note entry is missing, GAD-7 reliability is low, and contact info contains emails in the example data):
[
{"type": "missingness", "column": "note", "percent": 100.0, "threshold": 10.0},
{"type": "reliability", "target": "gad7", "metric": "cronbach_alpha", "value": 0.0, "threshold": 0.75},
{"type": "phi", "column": "contact", "matches": [{"pattern": "email", "count": 2}], "message": "PHI-like data detected in column 'contact'. Column removed from outputs."}
]The CLI also prints a short summary so you notice issues immediately.
Run the same profile again after a new batch of data and PRDT will also emit drift.json whenever a scale’s mean shifts by ≥1 point compared with the previous run.
Dataset A lives outside the repo (secure share). The configs/dataset_a.toml profile
describes how to clean/anonymize it:
export PRDT_ANON_KEY="..." # rotate per KEY_HANDLING.md
prdt --config configs/dataset_a.tomlinputpoints atdata/dataset_a/raw.csv(drop-in placeholder—update to the secure path on your machine).- Outputs land in
outputs/dataset-a/with the full manifest/report stack so you can attach them to OSF once PHI checks pass. - Score columns cover PHQ-9, GAD-7, and PCL-5 totals; adjust the config if the instrument list changes.
scripts/run_dataset_a.shwraps the CLI so you can overrideDATASET_A_INPUT/DATASET_A_OUTDIRwithout editing the profile.- See
docs/dataset_a.mdfor the full checklist plus demo artifacts underdocs/assets/dataset-a-demo/. - OSF demo bundle (synthetic Dataset A) with manifest + config: Stephen M. Jerge — Clinical Science Lab (DOI: https://doi.org/10.17605/OSF.IO/BX76K); PRDT component: https://osf.io/qs8ag/; Dataset A sub-component: https://osf.io/n4buw/ (generated Nov 14, 2025 via
scripts/run_dataset_a_osf_demo.sh).
Copy this section into docs/PortfolioHub.md once Dataset A publishes to OSF so the
workflow is discoverable.
When admissions reviewers or collaborators need the Dataset A artifact, regenerate it with the helper script and stash the outputs inside docs/assets/dataset-a-osf/:
export PRDT_ANON_KEY="$(openssl rand -hex 32)"
DATASET_A_INPUT="/secure/raw/dataset_a.csv" \
DATASET_A_OUTDIR="$(pwd)/outputs/dataset-a-osf" \
scripts/run_dataset_a.shCopy the sanitized outputs (interim_clean.csv, report.json, plots, alerts) into a staging folder, zip them (dataset-a-osf-bundle-YYYYMMDD.zip), and drop the archive under docs/assets/dataset-a-osf/bundle/. Move the latest run_manifest_*.json into docs/assets/dataset-a-osf/provenance/ so the OSF README can cite the exact command + git SHA. See docs/assets/dataset-a-osf/README.md for the upload checklist.
- /docs/README.md: links to a non-technical walkthrough, concept notes, and a copy/paste quickstart for new teammates or admissions reviewers.
- /docs/clinician_quickstart.md: one-page instructions + troubleshooting for clinicians.
report.jsoncontains analertsarray. Each entry describes either:type = "missingness"when a column’s missing percent exceedsmissing_pct.type = "reliability"when Cronbach’s α or McDonald’s ω drops below the configured minimum (overall or per-scale).
type = "phi"when the PHI scanner removes columns (emails, phones, MRNs, etc.).- Alerts are informational only; the CLI still writes outputs so you can review and decide on follow-up cleaning.
- When alerts exist, the CLI prints a brief summary and writes
alerts.jsonfor quick review.
PRDT now aborts when PHI-like columns are detected to prevent accidental exports.
- Inspect the columns listed in the error and clean or drop them.
- If the columns are expected (e.g., you plan to scrub them downstream), either list them under
[prdt.phi.allow_columns]or set--allow-phi-export/allow_phi_export = truein your config. phi_quarantine.csvstill records the flagged data for auditing, even when the guardrail fires.
interim_clean.csvreport.json(descriptives, correlations, reliability, missing, alerts)alerts.json(only created when thresholds trigger)data_dictionary.csv(column name, dtype, missing pct, example)run_manifest.json(PRDT version, git SHA, config hash, input hash)phi_quarantine.csv(only created when columns are removed for PHI risk)drift.json(only created when scale means change ≥ 1 point vs prior run)hist_*.png,trend_*.png,missingness.pngscale_summary.png,scale_items_<scale>.png(only when scale scoring is enabled)
-
Build artifacts (already present under
dist/, or runpython -m build). -
Install the wheel anywhere—no repository clone required:
pip install dist/prdt-0.1.4-py3-none-any.whl
Attach the wheel to GitHub Releases so reviewers can pip install PRDT directly.
- Never commit PHI/PII; keep only synthetic data in-repo
- Externalize secrets via
PRDT_ANON_KEY - Read
KEY_HANDLING.mdfor best practices (long random key,.envusage, rotation) - Prefer small, incremental commits with clear messages
- Record version/tags in release notes
- Additional reliability metrics (e.g., McDonald’s ω)
- More granular missingness visualizations
- Configurable alert thresholds for reliability/missingness
- CI job to run
pytest