Schema-driven extraction for pediatric and NICU genetic testing reports, with REDCap-compatible import/export and a Quartz-oriented vLLM workflow.
.
├── README.md
├── JOURNEY.md
├── main.py
├── pull_container.sbatch
├── redcap_columns_from_dictionary.json
├── run.sbatch
└── schema.py
This repository does four things:
- extracts structured genetics data from report text or PDFs
- preserves report-level linkage among test metadata, findings, HPO terms, and diagnoses
- imports from and exports back to the current flat REDCap layout
- runs batch extraction on IU Quartz with a local vLLM server inside Apptainer
main.py— main CLI for extraction and REDCap import/exportschema.py— single source of truth for the extraction schema and validationrun.sbatch— main Quartz batch entry point; starts vLLM and runs extractionpull_container.sbatch— one-time Slurm job to build the Apptainer imageredcap_columns_from_dictionary.json— current derived REDCap column orderJOURNEY.md— working status, blockers, and next steps
Create the environment once inside the repo.
python3.12 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip setuptools wheel
python -m pip install "pydantic>=2,<3"Optional extras:
# PDF input
python -m pip install "pymupdf>=1.24" "pdfplumber>=0.11"
# data dictionary parsing
python -m pip install "openpyxl>=3.1"Batch jobs should use the repo-local .venv directly. Do not rely on Quartz Python modules inside run.sbatch.
Keep model and container caches on project storage, not in home.
export HF_HOME=/N/project/textattn/hf_cache
mkdir -p "$HF_HOME"
printf '%s' "$HF_TOKEN" > "$HF_HOME/token"
chmod 600 "$HF_HOME/token"Recommended Apptainer cache locations:
export APPTAINER_CACHEDIR=/N/project/textattn/apptainer_cache
export APPTAINER_TMPDIR=/N/project/textattn/apptainer_tmp
mkdir -p "$APPTAINER_CACHEDIR" "$APPTAINER_TMPDIR"run.sbatch reads the Hugging Face token from $HF_HOME/token automatically.
Do this through Slurm, not on the login node.
sbatch pull_container.sbatchExpected image:
containers/vllm-openai-latest.sif
A successful sanity check looks like:
module purge
module load apptainer/1.3.6
apptainer exec --cleanenv containers/vllm-openai-latest.sif /bin/sh -lc 'python3 -c "import vllm; print(vllm.__version__)"'run.sbatch is the main entry point. It should remain the single production script.
It is responsible for:
- loading Apptainer
- using the repo-local
.venv - writing the GPT-OSS Hopper config
- starting a local vLLM server inside Apptainer
- waiting for readiness
- running
main.pyover the report files - writing JSON, CSV, and status outputs
mkdir -p logs outputs
sbatch --export=ALL,LIMIT=1 run.sbatchsbatch --export=ALL,LIMIT=5 run.sbatchsbatch --export=ALL,LIMIT=0 run.sbatch- On Quartz,
gpuandgpu-interactiveare V100 partitions. - On Quartz,
hopperis the H100 partition. gpt-oss-120bshould run onhopper, not on the V100 partitions.- At the moment, Hopper access is allocation-dependent. If the current Slurm account does not have Hopper QOS, jobs will fail with
Invalid qos specification. - Check the live partition layout with:
sinfo -o "%P %G %N" | egrep 'hopper|gpu|interactive'- Check account/QOS associations with:
sacctmgr -nP show assoc where user=$USER format=cluster,account,user,partition,qos%50,defaultqos%30Useful overrides at submission time:
INPUT_DIR— folder containing report.txtfilesLIMIT— number of files to process;0means all filesMODEL— defaults toopenai/gpt-oss-120bHF_HOME— Hugging Face cache rootIMG_FILE— Apptainer image pathRUN_DIR— output directory for the current job
Example:
sbatch --export=ALL,INPUT_DIR=/N/project/_A-Aa-a0-9/note/report,LIMIT=10 run.sbatchlogs/medace_quartz_test_<jobid>.log
logs/medace_quartz_test_<jobid>.err
logs/vllm_<jobid>.log
logs/vllm_<jobid>.err
logs/<report>_<jobid>.client.log
logs/<report>_<jobid>.client.err
The Slurm .err file is the main progress log.
outputs/quartz_<jobid>/json/*.extracted.json
outputs/quartz_<jobid>/extractions.csv
outputs/quartz_<jobid>/status.tsv
For debugging or non-Slurm use:
source .venv/bin/activate
python main.py --helpCommon tasks include:
- report extraction
- REDCap import to linked JSON
- linked JSON export back to REDCap-compatible rows
- printing canonical REDCap column order from a data dictionary
- If
apptainer pullis killed on the login node, usepull_container.sbatch. - If a Hopper job fails with
Invalid qos specification, the Slurm account does not currently have Hopper permission. - If a batch job prints noisy Lmod dependency messages at startup, do not chase them first; confirm whether the repo
.venvand Apptainer launch are actually failing. - If the vLLM server exits before readiness, inspect
logs/vllm_<jobid>.errfirst.
The current project state, blockers, and next actions live in JOURNEY.md.
MIT