Enforce strict null value handling and schema validation by ruizmaa · Pull Request #470 · bis-med-it/pysdmx

ruizmaa · 2025-11-26T09:32:34Z

This PR fixes issue #396 handling of missing values in SDMX messages. It focuses on preserving the distinction between empty strings and null values, enforcing schema requirements during writing.

Main changes

This PR implements the following changes:

1. Schema enforcement

Schema requirement: Writing operations now strictly require a Schema object in dataset.structure. _validate_schema_exists have been implemented to ensure it, raising an Invalid error if there is no schema in the dataset to convert.

2. CSV reading

Preserve explicit nulls: Updated pd.read_csv configuration (keep_default_na=False, na_values=[]) in all readers (v1.0, v2.0, v2.1). This ensures strings are read literally, so they can be handled correctly. Previously, na_values=[""] converted empty strings to NaN. Now with na_values=[], empty cells remain as empty strings, preserving the distinction between truly missing data and empty values.

3. XML reading

Complete attribute coverage: Updated __reading_generic_all in generic.py to ensure all observations contain all attributes defined in the structure, filling absent ones with "". This ensures consistent DataFrame structure and prevents missing columns in the output.

4. CSV writing

Structural logic: Introduced _csv_prepare_df to apply the structural logic before writing. This ensures that required columns exist (even if null) and empty optional columns are removed.
Removed string replacement: Eliminated the global .replace({"nan": "", "<NA>": ""}) logic. The writer now relies on the clean DataFrame state from _csv_prepare_df.
Code refactoring: Split writer logic into helper functions (_csv_structure_ref_and_id, _csv_determine_action, _csv_insert_labels_action).

5. XML writing

Attribute omission: Implemented logic (_format_observation_attributes) to filter attributes based on the schema.
- Required attributes: Always written (as empty strings if null).
- Optional attributes: Only written if they contain a value; otherwise, they are omitted from the XML tag.
Series/Observation processing: Refactored __process_series_observations and __format_ser_str. The logic now identifies and skips observations/series that are entirely empty or null, preventing the generation of empty <Obs> tags.
Removed string replacement: Eliminated .replace({"nan": "", "<NA>": ""}) logic.

Tests

Tests updated to use Schema for writers, ensuring compliance with the new validation logic.

New tests

Schema validation
- Added test_write_sdmx_dataset_without_structure to ensure writing without a Schema raises Invalid error.
Null value reading
- Added test_read_csv_v{1,2,21}_nulls and test_read_xml_v{21,30,31}_nulls to verify NaN and #N/A are read as literal strings, not converted to pandas nulls.
Attribute handling
- test_attributes_preservation_csv_source verifies required attributes are filled with "#N/A", optional empty attributes are dropped, and literal strings like "Nan" are preserved.
- test_xml_to_csv_attributes_preservation_xml_source same validation starting from XML source.
Roundtrip & schema validation
- test_data_rwr / test_write_sdmx_csv_read_back verify data integrity through format conversions.
- test_cross_format_integrity verifies data integrity by reading equivalent CSV and XML files, then writing them back to both formats to ensure the four outputs (CSV->CSV, CSV->XML, XML->CSV and XML->XML) matches the two originals.

…ill_na_values utility function to handle missing values

…es utility function

… requirements

… imports

…tion

…utils (_fill_na_values, _validate_explicit_null_values and NUMERIC_TYPES)

…l_values to use Schema type for structure parameter

…sted with every format at test_write_sdmx_dataset_without_structure)

…ings

…CSV formats

…processing

…eric series processing

javihern98 · 2025-12-15T15:55:30Z

Hi @sosna, ready for your review after we reviewed the changes discussed in the meeting

…equired attributes

…essing

…ce duplication

sosna

I have tested using the agreed input files sent a week ago.

This is the CSV input:

STRUCTURE,STRUCTURE_ID,ACTION,FREQ,DER_TYPE,DER_INSTR,DER_RISK,DER_REP_CTY,DER_SECTOR_CPY,DER_CPC,DER_SECTOR_UDL,DER_CURR_LEG1,DER_CURR_LEG2,DER_ISSUE_MAT,DER_RATING,DER_EX_METHOD,DER_BASIS,AVAILABILITY,COLLECTION,TIME_FORMAT,TIME_PERIOD,OBS_STATUS,OBS_VALUE,OBS_CONF
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,#N/A,A,,,,,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2003,M,NaN,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2004,,14206.490766,#N/A
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,P1M,,,,

This is the output in XML:

<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
	<mes:Header>
		<mes:ID>b76888bf-3f87-411f-a456-c95da43995ee</mes:ID>
		<mes:Test>false</mes:Test>
		<mes:Prepared>2025-12-18T08:01:43Z</mes:Prepared>
		<mes:Sender id="ZZZ"/>
		<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
			<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
		</mes:Structure>
	</mes:Header>
	<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Append">
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" OBS_STATUS="" AVAILABILITY="#N/A" COLLECTION="A" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2003" OBS_VALUE="NaN" OBS_STATUS="M" AVAILABILITY="" COLLECTION="" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2004" OBS_VALUE="14206.490766" OBS_STATUS="" AVAILABILITY="" COLLECTION="" OBS_CONF="#N/A" />
		<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" TIME_FORMAT="P1M" OBS_STATUS="" AVAILABILITY="" COLLECTION="" />
	</mes:DataSet>
</mes:StructureSpecificData>

The issues reported previously still seem to be there, namely:

Everything is an observation but Series were expected. Even worse, not only is it an Obs but TIME_PERIOD is empty (TIME_PERIOD=""), which can never be the case (i.e. if you claim something is an Obs, then no dimension can be missing).
There are empty attributes, which is also unexected (e.g. AVAILABILITY="" COLLECTION=""

As discussed, the expected output file was:

<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
	<mes:Header>
		<mes:ID>d6c7fe91-2879-43fc-9b07-4f0d71bfb43f</mes:ID>
		<mes:Test>false</mes:Test>
		<mes:Prepared>2025-11-28T13:28:53Z</mes:Prepared>
		<mes:Sender id="ZZZ"/>
		<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
			<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
		</mes:Structure>
	</mes:Header>
	<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Information">
		<Series FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" AVAILABILITY="#N/A" COLLECTION="A">
			<Obs TIME_PERIOD="2003" OBS_STATUS="M" OBS_VALUE="NaN" />
			<Obs TIME_PERIOD="2004" OBS_CONF="#N/A" OBS_VALUE="14206.490766" />
		</Series>
		<Series FREQ="M" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_FORMAT="P1M"/>
	</mes:DataSet>
</mes:StructureSpecificData>

In the other direction (from XML to CSV), things worked better. I was expecting one more row, with the series information. Unfortunately, it's not there, because the information has been pushed down to the observations, but at least it has been pushed down in a correct way, so this might be ok. I emphasized might because this means that, if an observation exists, everything is pushed down (though this is unnecessary), but if no observation exist, a "series" row is created, i.e. consumers have to handle two different ways to extract series information.

sosna · 2025-12-18T07:47:39Z

tests/model/test_data_types.py

Thanks, @ruizmaa. But is this new type really needed on your side?

I'm asking because then it needs to be handled wherever types are used, in pysdmx (e.g. https://github.com/bis-med-it/pysdmx/blob/develop/src/pysdmx/toolkit/pd/__init__.py) but also outside of it (e.g. in some of our internal systems that use pysdmx and that have mappings based on pysdmx.model.DataType.

In short, adding a new type has a cost and I want to be sure this is really needed. Thanks!

javihern98 · 2025-12-18T18:50:35Z

Need to review here #479, second and third check specified "after merging #470"

javihern98 · 2025-12-18T18:53:38Z

src/pysdmx/model/concept.py

+    INCREMENTAL = "Incremental"
+    """An integer type that increases sequentially."""


@ruizmaa Please remove this data type, see https://github.com/bis-med-it/pysdmx/pull/470/changes#r2632275417

javihern98 · 2025-12-19T15:32:54Z

Hi @sosna We will focus here only on the explicit null values as what is described in your comment #470 (review) seems like we are not writing properly the attributes related to a series.

Therefore we should address this in a separate issue (I created issue #492 to discuss this). We will nevertheless ensure we are not writing empty attributes

ruizmaa added 13 commits November 20, 2025 19:12

feat(toolkit): implement schema-aware null value handling by adding f…

009d83e

…ill_na_values utility function to handle missing values

feat(io/csv): enforce schema validation and correct null handling

45759ec

feat(writer): enhance null value handling by integrating fill_na_valu…

f05bde4

…es utility function

feat(validation): add schema validation for datasets in CSV writer

30f4140

fix(tests/csv): update tests and references to match schema and issue…

d568d16

… requirements

fix(tests/io): adapt tests and samples to supply Schema to writer

36306ad

test: reach 100% code coverage

890644f

refactor(tests): expose NUMERIC_TYPES at module level for easier test…

8d2a1ca

… imports

fix: ruff and mypy minor fixex

784ebe5

fix: refactor to pass ruff complexity check

0822f18

feat(model/data_types): add new DataType values and corresponding tests

f3bfe20

feat(reader): disable default NA conversion and implement null valida…

4eab870

…tion

feat(tests): add validation tests for explicit null values handling

cb02efa

ruizmaa linked an issue Nov 26, 2025 that may be closed by this pull request

Add more options to represent missing values #396

Open

ruizmaa marked this pull request as draft November 26, 2025 09:34

ruizmaa requested review from javihern98 and sosna November 26, 2025 09:34

ruizmaa self-assigned this Nov 26, 2025

ruizmaa and others added 3 commits November 26, 2025 10:50

Merge branch 'develop' into cr-396

62e941e

fix(tests/io): general writer tests failures after merge

1718b7c

fix(test/csv): fix CSV writer tests failing in data-only environment

d91fe37

ruizmaa added the bug Something isn't working label Nov 26, 2025

ruizmaa added this to the 1.10.0 milestone Nov 26, 2025

ruizmaa removed the request for review from sosna November 26, 2025 11:41

refactor: move internal pandas utils from toolkit to pysdmx.util._pd_…

17dc97e

…utils (_fill_na_values, _validate_explicit_null_values and NUMERIC_TYPES)

javihern98 removed this from the 1.10.0 milestone Nov 27, 2025

javihern98 mentioned this pull request Nov 27, 2025

Add VTL dataset to Pandas dataset conversion #461

Merged

ruizmaa added 3 commits November 27, 2025 12:01

refactor(pd_utils): update _fill_na_values and _validate_explicit_nul…

09a8811

…l_values to use Schema type for structure parameter

chore(pd_utils): moved _pd_utils.py from utils/ to io/

54252c4

fix(mypy): extract schema to variable for proper type narrowing

ff8dc97

ruizmaa requested a review from javihern98 December 9, 2025 16:48

ruizmaa and others added 11 commits December 9, 2025 18:13

refactor (tests/io): remove test_writer_invalid_structure (already te…

e8fe709

…sted with every format at test_write_sdmx_dataset_without_structure)

fix ruff

47a56b5

Merge branch 'develop' into cr-396

f60bdaa

refactor: remove _fill_na_values function and its usages

b1c9281

fix ruff

0beeb19

enhace(io/xml): avoid writing mandatory empty attributes as empty str…

1b2b6f3

…ings

refactor: remove redundant function __series_processing

ae3c380

enhance(tests): parameterize test_cross_format_integrity for XML and …

e4d22c8

…CSV formats

refactor(writer): improve handling of optional attributes and series …

a984aed

…processing

ruff fix

99da7da

refactor(generic writter): improve readability and unify logic in gen…

8ab6373

…eric series processing

javihern98 requested a review from sosna December 15, 2025 15:55

ruizmaa added 4 commits December 15, 2025 18:12

refactor(writer): enhance observation attribute handling to include r…

a40ff28

…equired attributes

refactor: restore original __series_processing function name

95caf17

refactor(writer): remove redundant attribute addition in dataset proc…

a5aac2e

…essing

refactor(io/xml): __series_processing to improve readability and redu…

348b5ee

…ce duplication

javihern98 mentioned this pull request Dec 17, 2025

Fix exception when using dimensionAtObservation=AllDimensions #481

Merged

sosna requested changes Dec 18, 2025

View reviewed changes

javihern98 reviewed Dec 18, 2025

View reviewed changes

ruizmaa and others added 2 commits December 18, 2025 20:55

refactor(data types): remove INCREMENTAL type from DataType enum

5797967

Merge branch 'develop' into cr-396

2201b24

javihern98 mentioned this pull request Dec 19, 2025

Writing attributes at series level as a new series with no obs #492

Closed

ruizmaa changed the title ~~Enforce strict null value handling, schema validation and extend DataTypes~~ Enforce strict null value handling and schema validation Dec 19, 2025

ruizmaa marked this pull request as draft January 4, 2026 02:15

Merge branch 'develop' into cr-396

f1a784c

ruizmaa mentioned this pull request Jan 7, 2026

Enforce Schema validation and preserve null values #499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce strict null value handling and schema validation#470

Enforce strict null value handling and schema validation#470
ruizmaa wants to merge 73 commits intodevelopfrom
cr-396

ruizmaa commented Nov 26, 2025 •

edited

Loading

Uh oh!

javihern98 commented Dec 15, 2025

Uh oh!

sosna left a comment

Uh oh!

sosna Dec 18, 2025

Uh oh!

javihern98 commented Dec 18, 2025

Uh oh!

javihern98 Dec 18, 2025

Uh oh!

javihern98 commented Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		INCREMENTAL = "Incremental"
		"""An integer type that increases sequentially."""

Conversation

ruizmaa commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main changes

1. Schema enforcement

2. CSV reading

3. XML reading

4. CSV writing

5. XML writing

Tests

New tests

Uh oh!

javihern98 commented Dec 15, 2025

Uh oh!

sosna left a comment

Choose a reason for hiding this comment

Uh oh!

sosna Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

javihern98 commented Dec 18, 2025

Uh oh!

javihern98 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

javihern98 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ruizmaa commented Nov 26, 2025 •

edited

Loading

javihern98 commented Dec 19, 2025 •

edited

Loading