Skip to content

Enforce Schema validation and preserve null values#499

Open
ruizmaa wants to merge 39 commits intodevelopfrom
cr-396-clean
Open

Enforce Schema validation and preserve null values#499
ruizmaa wants to merge 39 commits intodevelopfrom
cr-396-clean

Conversation

@ruizmaa
Copy link
Contributor

@ruizmaa ruizmaa commented Jan 7, 2026

This PR addresses issue #396 from scratch, superseding the PR #470

Main changes

1. Schema enforcement

  • Schema requirement: Writing operations now strictly require a Schema object in dataset.structure. validate_schema_exists have been implemented to ensure it, raising an Invalid error if there is no schema in the dataset to convert.

2. No longer null conversion to empty strings

  • Removed .replace({"nan": "", "<NA>": ""}), .fillna("") or .replace("nan", "") from the writers.
  • Replaced pd.read_csv(na_values=[""], ...) to pd.read_csv(na_values=[], ...) from the readers.

3. Normalize nulls when reading

  • Keeped XML readers (__ss_aux_reader.py) null normalization .replace(np.nan, "")
  • Updated CSV readers null replacement from .replace({"nan": "", "<NA>": ""}) to .replace({"nan": "NaN", "<NA>": "NaN"})
  • Added null normalization in generic reader as in non generic: .replace(np.nan, "") in __reading_generic_groups(), __reading_generic_series() and __reading_generic_all()

4. Deleted __remove_optional_attributes_empty_data() and added transform_dataframe_for_writing()

Now transform_dataframe_for_writing() recives a pd.DataFrame and a Schema and converts nulls and empty values based on if it's required or not.

It relies on auxiliary functions:

  • _get_value_to_write: decides based the value's dtype and if it's required whether if it should be written (if it has any value or it's null), skip it (if it's not required and has an empty value "") or write null (if its originally null (NaN or #N/A) or if it's required and has an empty value ""). It uses more auxiliary functions
    • _is_null_value
    • _is_nullable_integer_dtype: a helper function to correctly write numbers format (avoid 1 to be written as 1.0, 1.0 as 1`)
    • _get_null_representation: returns "NaN" for numeric and "#N/A" for the others based on the NUMERIC_TYPES set:
NUMERIC_TYPES = {
    DataType.BIG_INTEGER,
    DataType.COUNT,
    DataType.DECIMAL,
    DataType.DOUBLE,
    DataType.FLOAT,
    DataType.INTEGER,
    DataType.LONG,
    DataType.SHORT,
}

So the new behavior is:

Original file value Read value Written value (CSV) Written value (XML)
Required "NaN" "NaN" "NaN" "NaN"
Required "#N/A" "#N/A" "#N/A" "#N/A"
Required "" (CSV: ,, / XML: value="") "" "#N/A" (if not numeric) / "NaN" (if numeric) "#N/A" (if not numeric) / "NaN" (if numeric)
Required missing value (XML MissingTags) "" "#N/A" (if not numeric) / "NaN" (if numeric) "#N/A" (if not numeric) / "NaN" (if numeric)
Optional "NaN" "NaN" "NaN" "NaN"
Optional "#N/A" "#N/A" "#N/A" "#N/A"
Optional "" (CSV: ,, / XML: value="") "" "" Skipped (no tag)
Optional missing value (XML MissingTags) "" "" Skipped (no tag)

“Missing value” in XML (absent attribute tag) is normalized by the reader to an empty string "", so it follows the same rules as explicit empty values.

Tests

Fixed to fit into the new Schema requirement

CSV test writers

  • tests/io/csv/sdmx10/test_writer_v1.py: added a Schema fixture and fixed samples files (reference.csv and reference_attch_atts.csv) to match with the schema's dataflow id
  • tests/io/csv/sdmx20/test_writer_v2.py: added a Schema fixtures with context datastructure and another with context dataprovision. Fixed samples files (reference.csv, reference_attch_atts.csv and reference_with_action.csv) to match with the schema's dataflow id
  • tests/io/csv/sdmx21/test_writer_v21.py: added a Schema fixtures with context datastructure and another with context dataprovision. Fixed samples files (reference.csv, reference_attch_atts.csv, reference_with_action.csv and reference_with_append_action.csv) to match with the schema's dataflow id

General tests

  • tests/io/test_general_writer.py: added a Schema fixture
  • tests/io/test_writing_validation:
    • Updated test_data_rwr_no_structure, now it should raise a Invalid error (because it has no structre and Schema it's required)
    • Added a Schema to test_read_xml_write_csv_10 and test_read_xml_write_csv_20

XML tests

  • tests/io/xml/sdmx21/test/reader::test_read_write_structure_specific_all: added a Schema to the structure.
  • tests/io/xml/sdmx21/test/writer
    • test_data_writing.py::test_data_scape_quote: added a Schema
    • Fixed samples files (gen_ser.xml and test_generic_with_groups.xml)
  • tests/io/xml/sdmx30/test/reader::test_data_dataflow_31: added a Schema

Fixed to fit into the new null values handling

  • tests/io/test_writing_validation.py
    • test_data_rwr and test_write_sdmx_csv_read_back: updated the spected data to fit the new null and empty handle
  • tests/io/xmls/sdmx21/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spected
  • tests/io/xmls/sdmx30/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spected

Added to test the new null values handling

@ruizmaa ruizmaa force-pushed the cr-396-clean branch 3 times, most recently from 74eada3 to 9bd8cd0 Compare January 7, 2026 13:46
…der.read_sdxm() to avoid error: `ImportError: The 'xml' extra is required to run the reading and writing of SDMX-ML Messages`
…m_dataframe_for_writing() ensures values are not None
@ruizmaa ruizmaa linked an issue Jan 13, 2026 that may be closed by this pull request
@ruizmaa ruizmaa requested a review from javihern98 January 14, 2026 10:16
@javihern98
Copy link
Contributor

@ruizmaa Need to check conflicts on this PR, please address them 😊

@javihern98 javihern98 marked this pull request as ready for review January 28, 2026 16:31
@javihern98 javihern98 requested review from sosna and stratosn January 28, 2026 16:32
@sosna sosna added bug Something isn't working labels Feb 19, 2026
…nces

- Fix _get_value_to_write to skip optional null values (nan/None/pd.NA)
  instead of writing explicit #N/A or NaN representations. Required
  components still get explicit null representations.
- Reorder CSV columns following SDMX-CSV conventions: dimensions first,
  then measures, then attributes.
- Fix __write_keys to use get_codes() order instead of DataFrame column
  iteration order for deterministic key construction.
- Fix schema_generator to handle dataflow.structure when it is already
  a DataStructureDefinition instance instead of a URN string (#534).
Apply the isinstance check for DataStructureDefinition to both the
dataflow and provisionagreement branches in schema_generator, and add
tests covering both paths to restore 100% coverage.
Extract _resolve_dsd and _build_schema to eliminate duplicated DSD
resolution and Schema construction logic between the dataflow and
provisionagreement branches. Removes the # noqa: C901 complexity override.
The DSD instance handling fix and schema_generator simplification
have been moved to the cr-534 branch to keep concerns separated.
Structure-specific formats support partial updates where unchanged
attribute values don't need to be resent. The validation was too
strict, requiring all required non-dataset-level attributes to be
present in the DataFrame columns. Now only dimensions and measures
are mandatory.

Related issues filed for 1.14.0:
- #540: Add writer support for Time Series formats (STRTS / GENTS)
- #541: Dataflow short_urn becomes DataStructure in output
- #542: Accept string for dimension_at_observation parameter
Resolve merge conflicts integrating partial keys feature (develop)
with schema validation and null value handling (cr-396-clean).
Fixes branch coverage on Python 3.9 where pandas converts None to NaN.
Replaces pd.isna() with str() comparison to avoid branch coverage
issues across Python versions (3.9 vs 3.12 handle None/NaN differently).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more options to represent missing values

3 participants