This project is in its early development stages, so stability is not guaranteed, and documentation is limited. We welcome your feedback and contributions as we refine and expand this project together!
This package provides Python classes and utilities for working with metadata based on the Data Documentation Initiative (DDI), an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences.
Detailed documentation is available at https://dataartifex.org/docs/dartfx-ddi/
There are three major flavors of DDI. This package currently supports:
- DDI-Codebook 2.6: The lightweight version of the standard, intended primarily to document simple survey data.
- DDI-CDI 1.0: The new Cross Domain Integration specification. This package uses generated Pydantic models directly aligned with the official DDI-CDI 1.0 specifications.
- DDI-Codebook XML Processing: Load, parse, and extract structured metadata from DDI-Codebook documents.
- DDI-CDI Model (v1.0.0): Use definitive, spec-generated Pydantic classes for the full DDI-CDI implementation.
- Assistant Framework: A high-level API (
CdiClassAssistant) that simplifies CDI resource creation, automated identifier generation, and method proxying. - RDF Serialization: Built-in support for serializing CDI models to RDF graphs.
- Cross-Format Conversion: Transform DDI-Codebook metadata into DDI-CDI resources via the CDIF profile.
The project uses hatch as the build backend. For faster package management and virtual environment handling, uv is the preferred tool.
# Clone the repository
git clone https://github.com/DataArtifex/ddi-toolkit.git
cd ddi-toolkit
# Install dependencies using uv
uv pip install -e .
# Or using standard pip
pip install -e .uv pip install -e .[dev]from dartfx.ddi import ddicodebook
# Load from file
my_codebook = ddicodebook.loadxml('mycodebook.xml')
# Access variables from data files
if my_codebook.dataDscr:
for var in my_codebook.dataDscr.var:
print(f"Variable: {var.name}, Label: {var.labl.content if var.labl else 'No label'}")The Assistant framework provides a streamlined way to work with DDI-CDI without manually managing complex relationships or identifiers.
from dartfx.ddi.ddicdi import model_1_0_0 as model
from dartfx.ddi.ddicdi.assistants import CdiClassAssistant
# 1. Create a resource (Handles DDI Identification automatically)
dataset = CdiClassAssistant.create(model.DataSet, name="MyDataset")
# 2. Add elements (Methods are bound to the model instances)
variable = CdiClassAssistant.create(model.InstanceVariable, name="AGE")
dataset.add_variable(variable)
# 3. Serialize to RDF
graph = dataset.to_rdf_graph()
print(graph.serialize(format="turtle"))You can transform legacy DDI-Codebook 2.6 metadata into DDI-CDI 1.0 resources following the CDIF (Cross-Domain Integration Framework) profile.
from dartfx.ddi import ddicodebook
from dartfx.ddi.ddicodebook import utils as cb_utils
# 1. Load the DDI-Codebook XML
cb = ddicodebook.loadxml('my_codebook.xml')
# 2. Convert to DDI-CDI Graph
graph = cb_utils.codebook_to_cdif_graph(cb)
# 3. Output as Turtle
print(graph.serialize(format="turtle"))The toolkit provides a CLI utility dartfx-ddi to perform conversions and other operations directly from the terminal.
# Convert DDI-Codebook to CDI (default: Turtle output)
dartfx-ddi ddic2cdi my_codebook.xml
# Convert DDI-Codebook to CDI in XML format
dartfx-ddi ddic2cdi my_codebook.xml --format xmlFor advanced users needing to introspect the DDI-CDI specification itself:
from dartfx.ddi.ddicdi.specification import DdiCdiModel
# Load the model from specification files
cdi_spec = DdiCdiModel(root_dir="path/to/ddi-cdi-sources")
# Query classes and relationships
classes = cdi_spec.get_ucmis_classes()ddi-toolkit/
├── src/dartfx/ddi/
│ ├── ddicodebook/ # DDI-Codebook subpackage
│ │ ├── model.py # DDI-Codebook 2.6 models
│ │ └── utils.py # Codebook-specific utilities (e.g., conversion)
│ ├── ddicdi/ # DDI-CDI subpackage
│ │ ├── model_1_0_0.py # Definitive generated Pydantic models
│ │ ├── assistants.py # High-level Assistant framework
│ │ ├── specification.py # DDI-CDI spec introspection tools
│ │ └── utils.py # CDI-specific utilities (e.g., validation)
│ └── utils.py # Experimental simplified data models
├── tests/ # Test suite
└── docs/ # Documentation
- Migrate to Pydantic-based models (
model_1_0_0.py) - Implement robust Assistant Framework for resource management
- Automated DDI Identifier and URI management
- CDIF Profile conversion (Codebook to CDI)
- Comprehensive test coverage
- Complete documentation and API reference
- Enhanced RDF deserializer (Graph back to Assistant/Model)
- SQL schema generators and DCAT integration
- Enhanced DDI-Codebook to DDI-CDI conversion mappings
- Integration with LLMs for metadata enrichment
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature - Commit your changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin my-new-feature - Submit a pull request :D