Skip to content

Enforce strict null value handling and schema validation#470

Draft
ruizmaa wants to merge 73 commits intodevelopfrom
cr-396
Draft

Enforce strict null value handling and schema validation#470
ruizmaa wants to merge 73 commits intodevelopfrom
cr-396

Conversation

@ruizmaa
Copy link
Contributor

@ruizmaa ruizmaa commented Nov 26, 2025

This PR fixes issue #396 handling of missing values in SDMX messages. It focuses on preserving the distinction between empty strings and null values, enforcing schema requirements during writing.

Main changes

This PR implements the following changes:

1. Schema enforcement

  • Schema requirement: Writing operations now strictly require a Schema object in dataset.structure. _validate_schema_exists have been implemented to ensure it, raising an Invalid error if there is no schema in the dataset to convert.

2. CSV reading

  • Preserve explicit nulls: Updated pd.read_csv configuration (keep_default_na=False, na_values=[]) in all readers (v1.0, v2.0, v2.1). This ensures strings are read literally, so they can be handled correctly. Previously, na_values=[""] converted empty strings to NaN. Now with na_values=[], empty cells remain as empty strings, preserving the distinction between truly missing data and empty values.

3. XML reading

  • Complete attribute coverage: Updated __reading_generic_all in generic.py to ensure all observations contain all attributes defined in the structure, filling absent ones with "". This ensures consistent DataFrame structure and prevents missing columns in the output.

4. CSV writing

  • Structural logic: Introduced _csv_prepare_df to apply the structural logic before writing. This ensures that required columns exist (even if null) and empty optional columns are removed.
  • Removed string replacement: Eliminated the global .replace({"nan": "", "<NA>": ""}) logic. The writer now relies on the clean DataFrame state from _csv_prepare_df.
  • Code refactoring: Split writer logic into helper functions (_csv_structure_ref_and_id, _csv_determine_action, _csv_insert_labels_action).

5. XML writing

  • Attribute omission: Implemented logic (_format_observation_attributes) to filter attributes based on the schema.
    • Required attributes: Always written (as empty strings if null).
    • Optional attributes: Only written if they contain a value; otherwise, they are omitted from the XML tag.
  • Series/Observation processing: Refactored __process_series_observations and __format_ser_str. The logic now identifies and skips observations/series that are entirely empty or null, preventing the generation of empty <Obs> tags.
  • Removed string replacement: Eliminated .replace({"nan": "", "<NA>": ""}) logic.

Tests

  • Tests updated to use Schema for writers, ensuring compliance with the new validation logic.

New tests

  • Schema validation

    • Added test_write_sdmx_dataset_without_structure to ensure writing without a Schema raises Invalid error.
  • Null value reading

    • Added test_read_csv_v{1,2,21}_nulls and test_read_xml_v{21,30,31}_nulls to verify NaN and #N/A are read as literal strings, not converted to pandas nulls.
  • Attribute handling

    • test_attributes_preservation_csv_source verifies required attributes are filled with "#N/A", optional empty attributes are dropped, and literal strings like "Nan" are preserved.
    • test_xml_to_csv_attributes_preservation_xml_source same validation starting from XML source.
  • Roundtrip & schema validation

    • test_data_rwr / test_write_sdmx_csv_read_back verify data integrity through format conversions.
    • test_cross_format_integrity verifies data integrity by reading equivalent CSV and XML files, then writing them back to both formats to ensure the four outputs (CSV->CSV, CSV->XML, XML->CSV and XML->XML) matches the two originals.

@ruizmaa ruizmaa linked an issue Nov 26, 2025 that may be closed by this pull request
@ruizmaa ruizmaa marked this pull request as draft November 26, 2025 09:34
@ruizmaa ruizmaa requested review from javihern98 and sosna November 26, 2025 09:34
@ruizmaa ruizmaa self-assigned this Nov 26, 2025
@ruizmaa ruizmaa added the bug Something isn't working label Nov 26, 2025
@ruizmaa ruizmaa added this to the 1.10.0 milestone Nov 26, 2025
@ruizmaa ruizmaa removed the request for review from sosna November 26, 2025 11:41
…utils (_fill_na_values, _validate_explicit_null_values and NUMERIC_TYPES)
@javihern98 javihern98 removed this from the 1.10.0 milestone Nov 27, 2025
@ruizmaa ruizmaa requested a review from javihern98 December 9, 2025 16:48
@javihern98
Copy link
Contributor

Hi @sosna, ready for your review after we reviewed the changes discussed in the meeting

@javihern98 javihern98 requested a review from sosna December 15, 2025 15:55
Copy link
Collaborator

@sosna sosna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested using the agreed input files sent a week ago.

This is the CSV input:

STRUCTURE,STRUCTURE_ID,ACTION,FREQ,DER_TYPE,DER_INSTR,DER_RISK,DER_REP_CTY,DER_SECTOR_CPY,DER_CPC,DER_SECTOR_UDL,DER_CURR_LEG1,DER_CURR_LEG2,DER_ISSUE_MAT,DER_RATING,DER_EX_METHOD,DER_BASIS,AVAILABILITY,COLLECTION,TIME_FORMAT,TIME_PERIOD,OBS_STATUS,OBS_VALUE,OBS_CONF
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,#N/A,A,,,,,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2003,M,NaN,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2004,,14206.490766,#N/A
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,P1M,,,,

This is the output in XML:

<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
	<mes:Header>
		<mes:ID>b76888bf-3f87-411f-a456-c95da43995ee</mes:ID>
		<mes:Test>false</mes:Test>
		<mes:Prepared>2025-12-18T08:01:43Z</mes:Prepared>
		<mes:Sender id="ZZZ"/>
		<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
			<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
		</mes:Structure>
	</mes:Header>
	<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Append">
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" OBS_STATUS="" AVAILABILITY="#N/A" COLLECTION="A" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2003" OBS_VALUE="NaN" OBS_STATUS="M" AVAILABILITY="" COLLECTION="" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2004" OBS_VALUE="14206.490766" OBS_STATUS="" AVAILABILITY="" COLLECTION="" OBS_CONF="#N/A" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" TIME_FORMAT="P1M" OBS_STATUS="" AVAILABILITY="" COLLECTION="" />
	</mes:DataSet>
</mes:StructureSpecificData>

The issues reported previously still seem to be there, namely:

  • Everything is an observation but Series were expected. Even worse, not only is it an Obs but TIME_PERIOD is empty (TIME_PERIOD=""), which can never be the case (i.e. if you claim something is an Obs, then no dimension can be missing).
  • There are empty attributes, which is also unexected (e.g. AVAILABILITY="" COLLECTION=""

As discussed, the expected output file was:

<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
	<mes:Header>
		<mes:ID>d6c7fe91-2879-43fc-9b07-4f0d71bfb43f</mes:ID>
		<mes:Test>false</mes:Test>
		<mes:Prepared>2025-11-28T13:28:53Z</mes:Prepared>
		<mes:Sender id="ZZZ"/>
		<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
			<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
		</mes:Structure>
	</mes:Header>
	<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Information">
		<Series FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" AVAILABILITY="#N/A" COLLECTION="A">
			<Obs TIME_PERIOD="2003" OBS_STATUS="M" OBS_VALUE="NaN" />
			<Obs TIME_PERIOD="2004" OBS_CONF="#N/A" OBS_VALUE="14206.490766" />
		</Series>
		<Series FREQ="M" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_FORMAT="P1M"/>
	</mes:DataSet>
</mes:StructureSpecificData>

In the other direction (from XML to CSV), things worked better. I was expecting one more row, with the series information. Unfortunately, it's not there, because the information has been pushed down to the observations, but at least it has been pushed down in a correct way, so this might be ok. I emphasized might because this means that, if an observation exists, everything is pushed down (though this is unnecessary), but if no observation exist, a "series" row is created, i.e. consumers have to handle two different ways to extract series information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ruizmaa. But is this new type really needed on your side?

I'm asking because then it needs to be handled wherever types are used, in pysdmx (e.g. https://github.com/bis-med-it/pysdmx/blob/develop/src/pysdmx/toolkit/pd/__init__.py) but also outside of it (e.g. in some of our internal systems that use pysdmx and that have mappings based on pysdmx.model.DataType.

In short, adding a new type has a cost and I want to be sure this is really needed. Thanks!

@javihern98
Copy link
Contributor

Need to review here #479, second and third check specified "after merging #470"

Comment on lines +54 to +55
INCREMENTAL = "Incremental"
"""An integer type that increases sequentially."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javihern98
Copy link
Contributor

javihern98 commented Dec 19, 2025

Hi @sosna We will focus here only on the explicit null values as what is described in your comment #470 (review) seems like we are not writing properly the attributes related to a series.

Therefore we should address this in a separate issue (I created issue #492 to discuss this). We will nevertheless ensure we are not writing empty attributes

@ruizmaa ruizmaa changed the title Enforce strict null value handling, schema validation and extend DataTypes Enforce strict null value handling and schema validation Dec 19, 2025
@ruizmaa ruizmaa marked this pull request as draft January 4, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more options to represent missing values

3 participants