DMP Chef is an open-source (MIT License), Python-based pipeline that draft funder-compliant Data Management & Sharing Plan (DMPs) using a Large Language Model (LLM), such as Llama 3.3
It supports two modes entirely in Python:
- RAG: Retrieves related guidance from an indexed document collection and uses it to ground the draft. In this mode, the pipeline can ingest documents, build and search an index, and draft a DMP.
- No-RAG: Generates the draft only from the userβs project inputs (no retrieval).
This project is part of a broader extension of the DMP Tool platform. The ultimate goal is to integrate the DMP Chef pipeline into the DMP Tool platform, providing researchers with a familiar and convenient user interface that does not require any coding knowledge.
π Learn more: DMP-Chef.
The overall codebase is organized in alignment with the FAIR-BioRS guidelines. All Python code follows PEP 8 conventions, including consistent formatting, inline comments, and docstrings. Project dependencies are fully captured in requirements.txt. We also retain dmp-template as inside the prompt template used by the DMP generation workflow.
main.pyβ Command-line entry point for running the pipeline end-to-end.demo.ipynbβ Jupyter demo showing.
dmpchef/
βββ main.py # CLI entry point (run pipeline end-to-end)
βββ README.md # Project overview + usage
βββ requirements.txt # Python dependencies
βββ setup.py # Packaging (editable installs via pip install -e .)
βββ pyproject.toml # Build system config (wheel builds)
βββ MANIFEST.in # Include non-code files in distributions
βββ demo.ipynb # Notebook demo: import + run generate()
βββ LICENSE
βββ .gitignore
βββ .env # Local env vars (do not commit)
β
βββ dmpchef/ # Installable Python package (public API)
β βββ __init__.py # Exports: generate, draft
β βββ api.py # Importable API used by notebooks/backends
β
βββ config/ # Configuration
β βββ config.yaml # Main settings (models, paths, retriever params)
β βββ config_schema.py # Pydantic schema for DMPCHEF-Pipeline config
β βββ schema_validate.py # Validation/schema helpers for input.json
β
βββ data/ # Local workspace data + artifacts (not guaranteed in wheel)
β βββ inputs/ # Templates + examples
β β βββ nih-dms-plan-template.docx # NIH blank Word template
β β βββ input.json # Example request file
β βββ vector_db/ # Vector index artifacts (e.g., FAISS)
| βββ DMPtools_db/
| βββ NIH_all_db/
| βββ NIH_sharing_db/
β βββ data_ingestion/ # Source Pdfs and text from DMPtool+ NIH + NIH_sharing and etc
β βββ outputs/ # Generated artifacts
β β βββ markdown/ # Generated Markdown DMPs
β β βββ docx/ # Generated DOCX DMPs (template-preserving)
β β βββ json/ # DMPTool-compatible JSON outputs
β β βββ pdf/ # Optional PDFs converted from DOCX
β
β
βββ src/ # Core implementation
β βββ __init__.py
β βββ core_pipeline.py # Pipeline logic (RAG/no-RAG)
β βββ Build_index.py #Bulid index of vectore db
β βββ NIH_data_ingestion.py # NIH/DMPTool crawl β export PDFs to data/database
β
βββ prompt/ # Prompt templates/utilities
β βββ prompt_library.py
β
βββ utils/ # Shared helpers
β βββ config_loader.py
β βββ model_loader.py
β βββ dmptool_json.py
β βββ nih_docx_writer.py
β βββ download_vector_db.py
β
βββ logger/ # Logging utilities
β βββ __init__.py
β βββ custom_logger.py
β
βββ exception/ # Custom exceptions
β βββ __init__.py
β βββ custom_exception.py
β
βββ notebook_DMP_RAG/ # Notebooks/experiments (non-production)
βββ venv/ # Local virtualenv
git clone https://github.com/fairdataihub/dmpchef.git
cd dmpchef
code .Windows (cmd):
python -m venv venv
venv\Scripts\activate.batmacOS/Linux:
python -m venv venv
source venv/bin/activatepip install -r requirements.txt
# or (recommended for local dev)
pip install -e .Llama 3.3 (via Ollama)
-
Install Ollama from:
https://ollama.com/ -
Pull the llama3.3:
ollama pull llama3.3:70b Use demo.ipynb.
Use main.py
- Input.JSON: A single JSON file (e.g.,
data/inputs/input.json) that tells the pipeline what to generate. Before execution, the request is validated against Schema.JSON using the schema_validate validator.
{
"config": { ... },
"inputs": { ... }
}- config.funding.agency: Funder key (string; NIH|NSF|OTHER)
- config.funding.subagency: sub-agency (string; optional)
- config.pipeline.rag:
true/false(boolean flags; If omitted, the pipeline uses the YAML default (rag.enabled)). - config.pipeline.llm: LLM settings (boolean flags; e.g.,
provider,model_name). - config.export: Output (boolean flags;
md,docx,pdf,dmptool_json)
- inputs: A dictionary of user/project fields used to draft the plan include:
research_context,data_types,data_source,human_subjects,consent_status,data_volume, etc.
- Markdown: the generated funder-aligned DMP narrative (currently NIH structure).
- DOCX: generated using the funder template (NIH template today) to preserve official formatting.
- PDF: created by converting the DOCX (platform-dependent; typically works on Windows/macOS with Word).
- JSON: a DMPTool-compatible JSON file.
This work is licensed under the MIT License. See LICENSE for more information.
Use GitHub Issues to submit feedback, report problems, or suggest improvements.
You can also fork the repository and submit a Pull Request with your changes.
If you use this code, please cite this repository using the versioned DOI on Zenodo for the specific release you used (instructions will be added once the Zenodo record is available). For now, you can reference the repository here: fairdataihub/dmpchef.