Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,5 @@ sandbox/
venv/
venvs/
dandischema/_version.py
uv.lock
.cache/
138 changes: 138 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# CLAUDE.md

This file provides guidance to Claude Code when working with code in this
repository.

## Project Overview

dandischema defines the Pydantic v2 metadata models for the DANDI
neurophysiology data archive. It is used by both the dandi-cli client and the
dandi-archive server. Key concerns: model definitions, JSON Schema generation,
metadata validation, schema migration between versions, and asset metadata
aggregation.

## Build/Test Commands

```bash
tox -e py3 # Run full test suite (preferred)
pytest dandischema/ # Run tests directly in active venv
pytest dandischema/tests/test_metadata.py -v -k "test_name" # Single test
tox -e lint # codespell + flake8
tox -e typing # mypy (strict, with pydantic plugin)
```

- `filterwarnings = error` is active — new warnings will fail tests.
- Coverage is collected by default (`--cov=dandischema`).

## Code Style

- **Formatter**: Black (no explicit line-length override → default 88)
- **Import sorting**: isort with `profile = "black"`, `force_sort_within_sections`,
`reverse_relative`
- **Linting**: flake8 (max-line-length=100, ignores E203/W503)
- **Type checking**: mypy strict — `no_implicit_optional`, `warn_return_any`,
`warn_unreachable`, pydantic plugin enabled
- **Pre-commit hooks**: trailing-whitespace, end-of-file-fixer, check-yaml,
check-added-large-files, black, isort, codespell, flake8
- Imports at top of file; avoid function-level imports unless there is a
concrete reason (circular deps, heavy transitive imports)

## Architecture

### Key Modules

| File | Role |
|------|------|
| `models.py` | All Pydantic models (~2000 lines). Class hierarchy rooted at `DandiBaseModel`. |
| `metadata.py` | `validate()`, `migrate()`, `aggregate_assets_summary()`. |
| `consts.py` | `DANDI_SCHEMA_VERSION`, `ALLOWED_INPUT_SCHEMAS`, `ALLOWED_TARGET_SCHEMAS`. |
| `conf.py` | Instance configuration via env vars (`DANDI_INSTANCE_NAME`, etc.). |
| `types.py` | Custom Pydantic types (`ByteSizeJsonSchema`). |
| `utils.py` | JSON schema helpers, `version2tuple()`, `name2title()`. |
| `exceptions.py` | `ValidationError`, `JsonschemaValidationError`, `PydanticValidationError`. |
| `digests/` | `DandiETag` multipart-upload checksum calculation. |
| `datacite/` | DataCite DOI metadata conversion. |

### Model Hierarchy (simplified)

```
DandiBaseModel
├── PropertyValue # recursive (self-referencing)
├── BaseType
│ ├── StandardsType # name, identifier, version, extensions (recursive)
│ ├── ApproachType, AssayType, SampleType, Anatomy, ...
│ └── MeasurementTechniqueType
├── Person, Organization # Contributor subclasses
├── BioSample # recursive (wasDerivedFrom)
├── AssetsSummary # aggregated stats
└── CommonModel
├── Dandiset → PublishedDandiset
└── BareAsset → Asset → PublishedAsset
```

Several models are **self-referencing** (PropertyValue, BioSample,
StandardsType). These require `model_rebuild()` after the class definition.

### Data Flow: Asset Metadata Aggregation

1. dandi-cli calls `asset.get_metadata()` → populates `BareAsset` including
per-asset `dataStandard` list
2. Asset metadata is serialized via `model_dump(mode="json", exclude_none=True)`
3. Server calls `aggregate_assets_summary(assets)` →
`_add_asset_to_stats()` per asset → `AssetsSummary`
4. `_add_asset_to_stats()` collects: numberOfBytes, numberOfFiles, approach,
measurementTechnique, variableMeasured, species, subjects, dataStandard
5. `dataStandard` has deprecated path/encoding heuristic fallbacks for old
clients (remove after 2026-12-01)

### Pre-instantiated Standard Constants

```python
nwb_standard # RRID:SCR_015242
bids_standard # RRID:SCR_016124
ome_ngff_standard # DOI:10.25504/FAIRsharing.9af712
hed_standard # RRID:SCR_014074
```

These are dicts (`model_dump(mode="json", exclude_none=True)`) used by both
dandischema (heuristic fallbacks) and dandi-cli (per-asset population).

### Vendorization

The schema supports deployment for different DANDI instances. Environment
variables (`DANDI_INSTANCE_NAME`, `DANDI_INSTANCE_IDENTIFIER`,
`DANDI_DOI_PREFIX`, etc.) must be set **before** importing
`dandischema.models`. This dynamically adjusts identifier patterns, DOI
prefixes, license enums, and URL patterns. CI tests multiple vendored
configurations.

## Schema Change Checklist

When adding or removing fields from any model (BareAsset, Dandiset,
AssetsSummary, etc.):

1. **Update `_FIELDS_INTRODUCED` in `metadata.py:migrate()`** if adding a new
**top-level field to Dandiset metadata** — `migrate()` only processes
Dandiset-level dicts (not Asset metadata). Fields on BareAsset or nested
inside existing structures (e.g. new fields on StandardsType) do not need
entries here.

2. **Update `consts.py`** if bumping `DANDI_SCHEMA_VERSION` or adding to
`ALLOWED_INPUT_SCHEMAS`.

3. **Add tests** covering migration/aggregation with the new field.

4. **Coordinate with dandi-cli** — new fields that dandi-cli populates need
backward-compat guards there (check `"field" in Model.model_fields`) until
the minimum dandischema dependency is bumped.

## Testing Notes

- Tests use `filterwarnings = error` — any new deprecation warning will fail.
- The `clear_dandischema_modules_and_set_env_vars` fixture (conftest.py)
supports testing vendored configurations by clearing cached modules and
setting env vars.
- Network-dependent tests are skipped when `DANDI_TESTS_NONETWORK` is set.
- DataCite tests require `DATACITE_DEV_LOGIN` / `DATACITE_DEV_PASSWORD`.
- `test_models.py:test_duplicate_classes` checks for duplicate field qnames
across models; allowed duplicates are listed explicitly.
3 changes: 2 additions & 1 deletion dandischema/consts.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from packaging.version import Version as _Version

DANDI_SCHEMA_VERSION = "0.7.0"
DANDI_SCHEMA_VERSION = "0.8.0"
ALLOWED_INPUT_SCHEMAS = [
"0.4.4",
"0.5.1",
Expand All @@ -16,6 +16,7 @@
"0.6.8",
"0.6.9",
"0.6.10",
"0.7.0",
DANDI_SCHEMA_VERSION,
]

Expand Down
23 changes: 22 additions & 1 deletion dandischema/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,22 @@ def migrate(
if "schemaKey" not in obj_migrated:
obj_migrated["schemaKey"] = "Dandiset"

# Prune fields introduced in newer schema versions when target is older.
# Each entry: (version_introduced, list_of_top_level_field_paths_to_remove)
#
# Note: migrate() only processes Dandiset metadata (not Asset metadata).
# BareAsset.dataStandard (new in 0.7.0) does not need pruning here;
# AssetsSummary.dataStandard (nested in assetsSummary) pre-dates 0.7.0.
# StandardsType.version and .extensions are structural additions to an
# existing type — older consumers ignore unknown fields.
_FIELDS_INTRODUCED: list[tuple[str, list[str]]] = [
# ("0.7.0", ["field_on_dandiset"]),
]
for ver, fields in _FIELDS_INTRODUCED:
if target_ver_tuple < version2tuple(ver):
for field_path in fields:
obj_migrated.pop(field_path, None)

# Always update schemaVersion when migrating
obj_migrated["schemaVersion"] = to_version
return obj_migrated
Expand Down Expand Up @@ -543,9 +559,14 @@ def add_if_missing(standard: dict) -> None:
if standard not in stats["dataStandard"]:
stats["dataStandard"].append(standard)

# Collect per-asset dataStandard declarations populated by dandi-cli
for standard in assetmeta.get("dataStandard") or []:
add_if_missing(standard)

# DEPRECATED: path/encoding heuristic fallbacks for older clients that do not
# populate per-asset dataStandard. Remove after 2026-12-01.
if "nwb" in assetmeta["encodingFormat"]:
add_if_missing(models.nwb_standard)
# TODO: RF assumption that any .json implies BIDS
if Path(assetmeta["path"]).name == "dataset_description.json":
add_if_missing(models.bids_standard)
if Path(assetmeta["path"]).suffixes == [".ome", ".zarr"]:
Expand Down
27 changes: 27 additions & 0 deletions dandischema/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -860,11 +860,27 @@ class MeasurementTechniqueType(BaseType):
class StandardsType(BaseType):
"""Identifier for data standard used"""

version: Optional[str] = Field(
None,
description="Version of the standard used.",
json_schema_extra={"nskey": "schema"},
)
extensions: Optional[List["StandardsType"]] = Field(
None,
description="Extensions to the standard used in this dataset "
"(e.g. NWB extensions like ndx-*, HED library schemas).",
)
# TODO: consider how to formalize BIDS extensions (BEPs) once BIDS
# has a machine-readable way to declare them.
schemaKey: Literal["StandardsType"] = Field(
"StandardsType", validate_default=True, json_schema_extra={"readOnly": True}
)


# Self-referencing model needs rebuild after class definition
# https://docs.pydantic.dev/latest/concepts/postponed_annotations/#self-referencing-or-recursive-models
StandardsType.model_rebuild()

nwb_standard = StandardsType(
name="Neurodata Without Borders (NWB)",
identifier="RRID:SCR_015242",
Expand All @@ -880,6 +896,11 @@ class StandardsType(BaseType):
identifier="DOI:10.25504/FAIRsharing.9af712",
).model_dump(mode="json", exclude_none=True)

hed_standard = StandardsType(
name="Hierarchical Event Descriptors (HED)",
identifier="RRID:SCR_014074",
).model_dump(mode="json", exclude_none=True)


class ContactPoint(DandiBaseModel):
email: Optional[EmailStr] = Field(
Expand Down Expand Up @@ -1841,6 +1862,12 @@ class BareAsset(CommonModel):
json_schema_extra={"nskey": "prov"},
)

dataStandard: Optional[List[StandardsType]] = Field(
None,
description="Data standard(s) applicable to this asset.",
json_schema_extra={"readOnly": True},
)

# Bare asset is to be just Asset.
schemaKey: Literal["Asset"] = Field(
"Asset", validate_default=True, json_schema_extra={"readOnly": True}
Expand Down
49 changes: 49 additions & 0 deletions dandischema/tests/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -758,6 +758,55 @@ def test_aggregation_bids() -> None:
) # only a single entry so we do not duplicate them


def test_hed_standard_structure() -> None:
from dandischema.models import hed_standard

assert hed_standard["schemaKey"] == "StandardsType"
assert hed_standard["name"] == "Hierarchical Event Descriptors (HED)"
assert hed_standard["identifier"] == "RRID:SCR_014074"


def test_aggregate_per_asset_datastandard() -> None:
"""Per-asset dataStandard entries are collected into the summary."""
from dandischema.models import hed_standard

data = [
{
"schemaKey": "Asset",
"schemaVersion": "0.6.0",
"path": "dataset_description.json",
"encodingFormat": "application/json",
"contentSize": 512,
"digest": {"dandi:dandi-etag": "abc123-1"},
"dataStandard": [hed_standard],
},
]
summary = aggregate_assets_summary(data)
assert hed_standard in summary["dataStandard"]
# dataset_description.json also triggers BIDS via deprecated heuristic
assert sum("BIDS" in s.get("name", "") for s in summary["dataStandard"]) == 1


def test_aggregate_per_asset_datastandard_no_duplication() -> None:
"""No duplication when a standard is declared both per-asset and via heuristic."""
from dandischema.models import bids_standard

data = [
{
"schemaKey": "Asset",
"schemaVersion": "0.6.0",
"path": "dataset_description.json",
"encodingFormat": "application/json",
"contentSize": 512,
"digest": {"dandi:dandi-etag": "abc123-1"},
"dataStandard": [bids_standard],
},
]
summary = aggregate_assets_summary(data)
bids_count = sum("BIDS" in s.get("name", "") for s in summary["dataStandard"])
assert bids_count == 1, "BIDS should appear exactly once, not duplicated"


class TestValidateObjJson:
"""
Tests for `_validate_obj_json()`
Expand Down
7 changes: 6 additions & 1 deletion dandischema/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -582,7 +582,12 @@ def check_qname(qname: str, klass: type) -> None:
"RelatedParticipant",
):
return
if qname in "dandi:approach" and klass.__name__ in (
if qname == "dandi:approach" and klass.__name__ in (
"Asset",
"AssetsSummary",
):
return
if qname == "dandi:dataStandard" and klass.__name__ in (
"Asset",
"AssetsSummary",
):
Expand Down
Loading