Conversation
…ill_na_values utility function to handle missing values
…es utility function
…utils (_fill_na_values, _validate_explicit_null_values and NUMERIC_TYPES)
…l_values to use Schema type for structure parameter
…sted with every format at test_write_sdmx_dataset_without_structure)
…eric series processing
|
Hi @sosna, ready for your review after we reviewed the changes discussed in the meeting |
…equired attributes
sosna
left a comment
There was a problem hiding this comment.
I have tested using the agreed input files sent a week ago.
This is the CSV input:
STRUCTURE,STRUCTURE_ID,ACTION,FREQ,DER_TYPE,DER_INSTR,DER_RISK,DER_REP_CTY,DER_SECTOR_CPY,DER_CPC,DER_SECTOR_UDL,DER_CURR_LEG1,DER_CURR_LEG2,DER_ISSUE_MAT,DER_RATING,DER_EX_METHOD,DER_BASIS,AVAILABILITY,COLLECTION,TIME_FORMAT,TIME_PERIOD,OBS_STATUS,OBS_VALUE,OBS_CONF
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,#N/A,A,,,,,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2003,M,NaN,
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,,2004,,14206.490766,#N/A
dataflow,BIS:BIS_DER(1.0),M,A,U,A,B,5J,A,1E,A,HKD,TO1,A,A,3,C,,,P1M,,,,This is the output in XML:
<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
<mes:Header>
<mes:ID>b76888bf-3f87-411f-a456-c95da43995ee</mes:ID>
<mes:Test>false</mes:Test>
<mes:Prepared>2025-12-18T08:01:43Z</mes:Prepared>
<mes:Sender id="ZZZ"/>
<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
</mes:Structure>
</mes:Header>
<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Append">
<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" OBS_STATUS="" AVAILABILITY="#N/A" COLLECTION="A" />
<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2003" OBS_VALUE="NaN" OBS_STATUS="M" AVAILABILITY="" COLLECTION="" />
<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="2004" OBS_VALUE="14206.490766" OBS_STATUS="" AVAILABILITY="" COLLECTION="" OBS_CONF="#N/A" />
<Obs FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_PERIOD="" OBS_VALUE="" TIME_FORMAT="P1M" OBS_STATUS="" AVAILABILITY="" COLLECTION="" />
</mes:DataSet>
</mes:StructureSpecificData>The issues reported previously still seem to be there, namely:
- Everything is an observation but Series were expected. Even worse, not only is it an Obs but TIME_PERIOD is empty (
TIME_PERIOD=""), which can never be the case (i.e. if you claim something is an Obs, then no dimension can be missing). - There are empty attributes, which is also unexected (e.g.
AVAILABILITY="" COLLECTION=""
As discussed, the expected output file was:
<?xml version="1.0" encoding="UTF-8"?>
<mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v3_0/message https://registry.sdmx.org/schemas/v3_0/SDMXMessage.xsd">
<mes:Header>
<mes:ID>d6c7fe91-2879-43fc-9b07-4f0d71bfb43f</mes:ID>
<mes:Test>false</mes:Test>
<mes:Prepared>2025-11-28T13:28:53Z</mes:Prepared>
<mes:Sender id="ZZZ"/>
<mes:Structure structureID="BIS_DER" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)" dimensionAtObservation="AllDimensions">
<com:StructureUsage>urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:BIS_DER(1.0)</com:StructureUsage>
</mes:Structure>
</mes:Header>
<mes:DataSet ss:structureRef="BIS_DER" xsi:type="ns1:DataSetType" action="Information">
<Series FREQ="A" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" AVAILABILITY="#N/A" COLLECTION="A">
<Obs TIME_PERIOD="2003" OBS_STATUS="M" OBS_VALUE="NaN" />
<Obs TIME_PERIOD="2004" OBS_CONF="#N/A" OBS_VALUE="14206.490766" />
</Series>
<Series FREQ="M" DER_TYPE="U" DER_INSTR="A" DER_RISK="B" DER_REP_CTY="5J" DER_SECTOR_CPY="A" DER_CPC="1E" DER_SECTOR_UDL="A" DER_CURR_LEG1="HKD" DER_CURR_LEG2="TO1" DER_ISSUE_MAT="A" DER_RATING="A" DER_EX_METHOD="3" DER_BASIS="C" TIME_FORMAT="P1M"/>
</mes:DataSet>
</mes:StructureSpecificData>In the other direction (from XML to CSV), things worked better. I was expecting one more row, with the series information. Unfortunately, it's not there, because the information has been pushed down to the observations, but at least it has been pushed down in a correct way, so this might be ok. I emphasized might because this means that, if an observation exists, everything is pushed down (though this is unnecessary), but if no observation exist, a "series" row is created, i.e. consumers have to handle two different ways to extract series information.
There was a problem hiding this comment.
Thanks, @ruizmaa. But is this new type really needed on your side?
I'm asking because then it needs to be handled wherever types are used, in pysdmx (e.g. https://github.com/bis-med-it/pysdmx/blob/develop/src/pysdmx/toolkit/pd/__init__.py) but also outside of it (e.g. in some of our internal systems that use pysdmx and that have mappings based on pysdmx.model.DataType.
In short, adding a new type has a cost and I want to be sure this is really needed. Thanks!
src/pysdmx/model/concept.py
Outdated
| INCREMENTAL = "Incremental" | ||
| """An integer type that increases sequentially.""" |
There was a problem hiding this comment.
@ruizmaa Please remove this data type, see https://github.com/bis-med-it/pysdmx/pull/470/changes#r2632275417
|
Hi @sosna We will focus here only on the explicit null values as what is described in your comment #470 (review) seems like we are not writing properly the attributes related to a series. Therefore we should address this in a separate issue (I created issue #492 to discuss this). We will nevertheless ensure we are not writing empty attributes |
This PR fixes issue #396 handling of missing values in SDMX messages. It focuses on preserving the distinction between empty strings and null values, enforcing schema requirements during writing.
Main changes
This PR implements the following changes:
1. Schema enforcement
Schemaobject indataset.structure._validate_schema_existshave been implemented to ensure it, raising anInvaliderror if there is no schema in the dataset to convert.2. CSV reading
pd.read_csvconfiguration (keep_default_na=False,na_values=[]) in all readers (v1.0, v2.0, v2.1). This ensures strings are read literally, so they can be handled correctly. Previously,na_values=[""]converted empty strings toNaN. Now withna_values=[], empty cells remain as empty strings, preserving the distinction between truly missing data and empty values.3. XML reading
__reading_generic_allingeneric.pyto ensure all observations contain all attributes defined in the structure, filling absent ones with"". This ensures consistent DataFrame structure and prevents missing columns in the output.4. CSV writing
_csv_prepare_dfto apply the structural logic before writing. This ensures that required columns exist (even if null) and empty optional columns are removed..replace({"nan": "", "<NA>": ""})logic. The writer now relies on the clean DataFrame state from_csv_prepare_df._csv_structure_ref_and_id,_csv_determine_action,_csv_insert_labels_action).5. XML writing
_format_observation_attributes) to filter attributes based on the schema.__process_series_observationsand__format_ser_str. The logic now identifies and skips observations/series that are entirely empty or null, preventing the generation of empty<Obs>tags..replace({"nan": "", "<NA>": ""})logic.Tests
Schemafor writers, ensuring compliance with the new validation logic.New tests
Schema validation
test_write_sdmx_dataset_without_structureto ensure writing without aSchemaraisesInvaliderror.Null value reading
test_read_csv_v{1,2,21}_nullsandtest_read_xml_v{21,30,31}_nullsto verifyNaNand#N/Aare read as literal strings, not converted to pandas nulls.Attribute handling
test_attributes_preservation_csv_sourceverifies required attributes are filled with"#N/A", optional empty attributes are dropped, and literal strings like "Nan" are preserved.test_xml_to_csv_attributes_preservation_xml_sourcesame validation starting from XML source.Roundtrip & schema validation
test_data_rwr/test_write_sdmx_csv_read_backverify data integrity through format conversions.test_cross_format_integrityverifies data integrity by reading equivalent CSV and XML files, then writing them back to both formats to ensure the four outputs (CSV->CSV, CSV->XML, XML->CSV and XML->XML) matches the two originals.