Enforce Schema validation and preserve null values by ruizmaa · Pull Request #499 · bis-med-it/pysdmx

ruizmaa · 2026-01-07T12:30:51Z

This PR addresses issue #396 from scratch, superseding the PR #470

Main changes

1. Schema enforcement

Schema requirement: Writing operations now strictly require a Schema object in dataset.structure. validate_schema_exists have been implemented to ensure it, raising an Invalid error if there is no schema in the dataset to convert.

2. No longer null conversion to empty strings

Removed .replace({"nan": "", "<NA>": ""}), .fillna("") or .replace("nan", "") from the writers.
Replaced pd.read_csv(na_values=[""], ...) to pd.read_csv(na_values=[], ...) from the readers.

3. Normalize nulls when reading

Keeped XML readers (__ss_aux_reader.py) null normalization .replace(np.nan, "")
Updated CSV readers null replacement from .replace({"nan": "", "<NA>": ""}) to .replace({"nan": "NaN", "<NA>": "NaN"})
Added null normalization in generic reader as in non generic: .replace(np.nan, "") in __reading_generic_groups(), __reading_generic_series() and __reading_generic_all()

4. Deleted `__remove_optional_attributes_empty_data()` and added `transform_dataframe_for_writing()`

Now transform_dataframe_for_writing() recives a pd.DataFrame and a Schema and converts nulls and empty values based on if it's required or not.

It relies on auxiliary functions:

_get_value_to_write: decides based the value's dtype and if it's required whether if it should be written (if it has any value or it's null), skip it (if it's not required and has an empty value "") or write null (if its originally null (NaN or #N/A) or if it's required and has an empty value ""). It uses more auxiliary functions
- _is_null_value
- _is_nullable_integer_dtype: a helper function to correctly write numbers format (avoid 1 to be written as 1.0, 1.0 as 1`)
- _get_null_representation: returns "NaN" for numeric and "#N/A" for the others based on the NUMERIC_TYPES set:

NUMERIC_TYPES = {
    DataType.BIG_INTEGER,
    DataType.COUNT,
    DataType.DECIMAL,
    DataType.DOUBLE,
    DataType.FLOAT,
    DataType.INTEGER,
    DataType.LONG,
    DataType.SHORT,
}

So the new behavior is:

Original file value	Read value	Written value (CSV)	Written value (XML)
Required `"NaN"`	`"NaN"`	`"NaN"`	`"NaN"`
Required `"#N/A"`	`"#N/A"`	`"#N/A"`	`"#N/A"`
Required `""` (CSV: ,, / XML: value="")	`""`	`"#N/A"` (if not numeric) / `"NaN"` (if numeric)	`"#N/A"` (if not numeric) / `"NaN"` (if numeric)
Required missing value (XML MissingTags)	`""`	`"#N/A"` (if not numeric) / `"NaN"` (if numeric)	`"#N/A"` (if not numeric) / `"NaN"` (if numeric)
Optional `"NaN"`	`"NaN"`	`"NaN"`	`"NaN"`
Optional `"#N/A"`	`"#N/A"`	`"#N/A"`	`"#N/A"`
Optional `""` (CSV: ,, / XML: value="")	`""`	`""`	Skipped (no tag)
Optional missing value (XML MissingTags)	`""`	`""`	Skipped (no tag)

“Missing value” in XML (absent attribute tag) is normalized by the reader to an empty string "", so it follows the same rules as explicit empty values.

Tests

Fixed to fit into the new Schema requirement

CSV test writers

tests/io/csv/sdmx10/test_writer_v1.py: added a Schema fixture and fixed samples files (reference.csv and reference_attch_atts.csv) to match with the schema's dataflow id
tests/io/csv/sdmx20/test_writer_v2.py: added a Schema fixtures with context datastructure and another with context dataprovision. Fixed samples files (reference.csv, reference_attch_atts.csv and reference_with_action.csv) to match with the schema's dataflow id
tests/io/csv/sdmx21/test_writer_v21.py: added a Schema fixtures with context datastructure and another with context dataprovision. Fixed samples files (reference.csv, reference_attch_atts.csv, reference_with_action.csv and reference_with_append_action.csv) to match with the schema's dataflow id

General tests

tests/io/test_general_writer.py: added a Schema fixture
tests/io/test_writing_validation:
- Updated test_data_rwr_no_structure, now it should raise a Invalid error (because it has no structre and Schema it's required)
- Added a Schema to test_read_xml_write_csv_10 and test_read_xml_write_csv_20

XML tests

tests/io/xml/sdmx21/test/reader::test_read_write_structure_specific_all: added a Schema to the structure.
tests/io/xml/sdmx21/test/writer
- test_data_writing.py::test_data_scape_quote: added a Schema
- Fixed samples files (gen_ser.xml and test_generic_with_groups.xml)
tests/io/xml/sdmx30/test/reader::test_data_dataflow_31: added a Schema

Fixed to fit into the new null values handling

tests/io/test_writing_validation.py
- test_data_rwr and test_write_sdmx_csv_read_back: updated the spected data to fit the new null and empty handle
tests/io/xmls/sdmx21/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spected
tests/io/xmls/sdmx30/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spected

Added to test the new null values handling

…andle

…ema can be the structure

…der.read_sdxm() to avoid error: `ImportError: The 'xml' extra is required to run the reading and writing of SDMX-ML Messages`

…rite_csv_10 and test_read_xml_write_csv_20)

…med to nulls and not required emptys are skiped)

…ame_for_writing

…m_dataframe_for_writing() ensures values are not None

…ML readers

javihern98 · 2026-01-15T11:04:03Z

@ruizmaa Need to check conflicts on this PR, please address them 😊

…ready performed in writing_validation() before

…nces - Fix _get_value_to_write to skip optional null values (nan/None/pd.NA) instead of writing explicit #N/A or NaN representations. Required components still get explicit null representations. - Reorder CSV columns following SDMX-CSV conventions: dimensions first, then measures, then attributes. - Fix __write_keys to use get_codes() order instead of DataFrame column iteration order for deterministic key construction. - Fix schema_generator to handle dataflow.structure when it is already a DataStructureDefinition instance instead of a URN string (#534).

Apply the isinstance check for DataStructureDefinition to both the dataflow and provisionagreement branches in schema_generator, and add tests covering both paths to restore 100% coverage.

Extract _resolve_dsd and _build_schema to eliminate duplicated DSD resolution and Schema construction logic between the dataflow and provisionagreement branches. Removes the # noqa: C901 complexity override.

The DSD instance handling fix and schema_generator simplification have been moved to the cr-534 branch to keep concerns separated.

Structure-specific formats support partial updates where unchanged attribute values don't need to be resent. The validation was too strict, requiring all required non-dataset-level attributes to be present in the DataFrame columns. Now only dimensions and measures are mandatory. Related issues filed for 1.14.0: - #540: Add writer support for Time Series formats (STRTS / GENTS) - #541: Dataflow short_urn becomes DataStructure in output - #542: Accept string for dimension_at_observation parameter

Resolve merge conflicts integrating partial keys feature (develop) with schema validation and null value handling (cr-396-clean).

Fixes branch coverage on Python 3.9 where pandas converts None to NaN.

Replaces pd.isna() with str() comparison to avoid branch coverage issues across Python versions (3.9 vs 3.12 handle None/NaN differently).

ruizmaa and others added 6 commits January 5, 2026 14:32

feat(io/writer): add schema validation and preserve null values

03beabb

fix(io/reader): preserve null values

b6c74ae

fix(tests): adapted test to the new Schema requirement and new null h…

095a703

…andle

fix(io/xml): removed check for dataset structure type as now only Sch…

7deed1a

…ema can be the structure

test(writer): add tests for empty attributes

dee57bd

Merge branch 'develop' into cr-396-clean

7662bec

ruizmaa force-pushed the cr-396-clean branch 3 times, most recently from 74eada3 to 9bd8cd0 Compare January 7, 2026 13:46

refactor(tests): added new builr schema fixture without pysdmx.io.rea…

49b921e

…der.read_sdxm() to avoid error: `ImportError: The 'xml' extra is required to run the reading and writing of SDMX-ML Messages`

ruizmaa force-pushed the cr-396-clean branch from 9bd8cd0 to 49b921e Compare January 7, 2026 13:49

ruizmaa added 6 commits January 7, 2026 17:56

fix(tests): add schema fixture for CSV writing tests (test_read_xml_w…

21ee99b

…rite_csv_10 and test_read_xml_write_csv_20)

feat(writing): implement null value handling (required empty transfor…

3dc80ca

…med to nulls and not required emptys are skiped)

ruff fixes

73b6ed5

refactor: rename _transform_dataframe_for_writing to transform_datafr…

c7d1303

…ame_for_writing

ruff fixes

fca7492

refactor: rename _validate_schema_exists to validate_schema_exists

e0f577d

ruizmaa force-pushed the cr-396-clean branch from 7dcfde8 to 75c10e0 Compare January 8, 2026 14:43

enhance(io/xml/writer): remove unnecessary None checks since transfor…

761fe89

…m_dataframe_for_writing() ensures values are not None

ruizmaa force-pushed the cr-396-clean branch from 75c10e0 to 761fe89 Compare January 8, 2026 14:58

ruizmaa and others added 7 commits January 8, 2026 16:00

ruff fixes

3f30dbb

fix(io/readers): normalize null representations to "NaN" in CSV and X…

2a2042d

…ML readers

fix(tests): update expected output for ATTR_OPT

c70a84f

fix(io): normalize null representations to empty strings

b3baa29

Merge branch 'develop' into cr-396-clean

ee1d6c5

ruff fix

d42922a

fix(tests): add tests for writing optional attributes

a417bfe

ruizmaa linked an issue Jan 13, 2026 that may be closed by this pull request

Add more options to represent missing values #396

Open

ruizmaa requested a review from javihern98 January 14, 2026 10:16

ruizmaa and others added 2 commits January 15, 2026 13:08

Merge branch 'develop' into cr-396-clean

4273501

Merge branch 'develop' into cr-396-clean

68a8a87

javihern98 marked this pull request as ready for review January 28, 2026 16:31

javihern98 requested review from sosna and stratosn January 28, 2026 16:32

Merge branch 'develop' into cr-396-clean

ba78c3b

sosna added bug Something isn't working labels Feb 19, 2026

ruizmaa and others added 6 commits February 20, 2026 10:24

Merge branch 'develop' into cr-396-clean

1e9ab6b

fix mypy and ruff

48650ef

fix: add check for missing dimension columns in dataset validation

4e739e3

fix: enhance observation validity check to include None and NaN values

e753f40

fix: remove unnecessary schema check in dimension validation as is al…

af0a3d6

…ready performed in writing_validation() before

Merge branch 'develop' into cr-396-clean

22b4c12

javihern98 mentioned this pull request Feb 25, 2026

Support partial key rows in SDMX-CSV writer #537

Closed

javihern98 added 9 commits February 25, 2026 21:58

fix: handle DSD instances in dataflow and provision agreement paths

6825139

Apply the isinstance check for DataStructureDefinition to both the dataflow and provisionagreement branches in schema_generator, and add tests covering both paths to restore 100% coverage.

refactor: simplify schema_generator by extracting shared helpers

e47e41f

Extract _resolve_dsd and _build_schema to eliminate duplicated DSD resolution and Schema construction logic between the dataflow and provisionagreement branches. Removes the # noqa: C901 complexity override.

revert: move #534 fix to dedicated cr-534 branch

412b469

The DSD instance handling fix and schema_generator simplification have been moved to the cr-534 branch to keep concerns separated.

Merge branch 'develop' into cr-396-clean

054160f

Merge branch 'develop' into cr-396-clean

a5f3ee8

Resolve merge conflicts integrating partial keys feature (develop) with schema validation and null value handling (cr-396-clean).

fix: use pd.isna() for null check in partial key generation

21dd921

Fixes branch coverage on Python 3.9 where pandas converts None to NaN.

fix: use string comparison for null check in partial key generation

a704b91

Replaces pd.isna() with str() comparison to avoid branch coverage issues across Python versions (3.9 vs 3.12 handle None/NaN differently).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce Schema validation and preserve null values#499

Enforce Schema validation and preserve null values#499
ruizmaa wants to merge 39 commits intodevelopfrom
cr-396-clean

ruizmaa commented Jan 7, 2026 •

edited

Loading

Uh oh!

javihern98 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ruizmaa commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main changes

1. Schema enforcement

2. No longer null conversion to empty strings

3. Normalize nulls when reading

4. Deleted __remove_optional_attributes_empty_data() and added transform_dataframe_for_writing()

Tests

Fixed to fit into the new Schema requirement

CSV test writers

General tests

XML tests

Fixed to fit into the new null values handling

Added to test the new null values handling

Uh oh!

javihern98 commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruizmaa commented Jan 7, 2026 •

edited

Loading

4. Deleted `__remove_optional_attributes_empty_data()` and added `transform_dataframe_for_writing()`