Enforce Schema validation and preserve null values#499
Open
Enforce Schema validation and preserve null values#499
Conversation
…ema can be the structure
74eada3 to
9bd8cd0
Compare
…der.read_sdxm() to avoid error: `ImportError: The 'xml' extra is required to run the reading and writing of SDMX-ML Messages`
…rite_csv_10 and test_read_xml_write_csv_20)
…med to nulls and not required emptys are skiped)
…m_dataframe_for_writing() ensures values are not None
Contributor
|
@ruizmaa Need to check conflicts on this PR, please address them 😊 |
…ready performed in writing_validation() before
…nces - Fix _get_value_to_write to skip optional null values (nan/None/pd.NA) instead of writing explicit #N/A or NaN representations. Required components still get explicit null representations. - Reorder CSV columns following SDMX-CSV conventions: dimensions first, then measures, then attributes. - Fix __write_keys to use get_codes() order instead of DataFrame column iteration order for deterministic key construction. - Fix schema_generator to handle dataflow.structure when it is already a DataStructureDefinition instance instead of a URN string (#534).
Apply the isinstance check for DataStructureDefinition to both the dataflow and provisionagreement branches in schema_generator, and add tests covering both paths to restore 100% coverage.
Extract _resolve_dsd and _build_schema to eliminate duplicated DSD resolution and Schema construction logic between the dataflow and provisionagreement branches. Removes the # noqa: C901 complexity override.
The DSD instance handling fix and schema_generator simplification have been moved to the cr-534 branch to keep concerns separated.
Structure-specific formats support partial updates where unchanged attribute values don't need to be resent. The validation was too strict, requiring all required non-dataset-level attributes to be present in the DataFrame columns. Now only dimensions and measures are mandatory. Related issues filed for 1.14.0: - #540: Add writer support for Time Series formats (STRTS / GENTS) - #541: Dataflow short_urn becomes DataStructure in output - #542: Accept string for dimension_at_observation parameter
Resolve merge conflicts integrating partial keys feature (develop) with schema validation and null value handling (cr-396-clean).
Fixes branch coverage on Python 3.9 where pandas converts None to NaN.
Replaces pd.isna() with str() comparison to avoid branch coverage issues across Python versions (3.9 vs 3.12 handle None/NaN differently).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses issue #396 from scratch, superseding the PR #470
Main changes
1. Schema enforcement
Schemaobject indataset.structure.validate_schema_existshave been implemented to ensure it, raising anInvaliderror if there is no schema in the dataset to convert.2. No longer null conversion to empty strings
.replace({"nan": "", "<NA>": ""}),.fillna("")or.replace("nan", "")from the writers.pd.read_csv(na_values=[""], ...)topd.read_csv(na_values=[], ...)from the readers.3. Normalize nulls when reading
__ss_aux_reader.py) null normalization.replace(np.nan, "").replace({"nan": "", "<NA>": ""})to.replace({"nan": "NaN", "<NA>": "NaN"}).replace(np.nan, "")in__reading_generic_groups(),__reading_generic_series()and__reading_generic_all()4. Deleted
__remove_optional_attributes_empty_data()and addedtransform_dataframe_for_writing()Now
transform_dataframe_for_writing()recives apd.DataFrameand aSchemaand converts nulls and empty values based on if it's required or not.It relies on auxiliary functions:
_get_value_to_write: decides based the value'sdtypeand if it'srequiredwhether if it should be written (if it has any value or it's null), skip it (if it's not required and has an empty value"") or write null (if its originally null (NaNor#N/A) or if it's required and has an empty value""). It uses more auxiliary functions_is_null_value_is_nullable_integer_dtype: a helper function to correctly write numbers format (avoid1to be written as1.0,1.0as 1`)_get_null_representation: returns"NaN"for numeric and"#N/A"for the others based on theNUMERIC_TYPESset:So the new behavior is:
"NaN""NaN""NaN""NaN""#N/A""#N/A""#N/A""#N/A"""(CSV: ,, / XML: value="")"""#N/A"(if not numeric) /"NaN"(if numeric)"#N/A"(if not numeric) /"NaN"(if numeric)"""#N/A"(if not numeric) /"NaN"(if numeric)"#N/A"(if not numeric) /"NaN"(if numeric)"NaN""NaN""NaN""NaN""#N/A""#N/A""#N/A""#N/A"""(CSV: ,, / XML: value="")""""""""Tests
Fixed to fit into the new Schema requirement
CSV test writers
tests/io/csv/sdmx10/test_writer_v1.py: added aSchemafixture and fixed samples files (reference.csvandreference_attch_atts.csv) to match with the schema's dataflow idtests/io/csv/sdmx20/test_writer_v2.py: added aSchemafixtures with contextdatastructureand another with contextdataprovision. Fixed samples files (reference.csv,reference_attch_atts.csvandreference_with_action.csv) to match with the schema's dataflow idtests/io/csv/sdmx21/test_writer_v21.py: added aSchemafixtures with contextdatastructureand another with contextdataprovision. Fixed samples files (reference.csv,reference_attch_atts.csv,reference_with_action.csvandreference_with_append_action.csv) to match with the schema's dataflow idGeneral tests
tests/io/test_general_writer.py: added aSchemafixturetests/io/test_writing_validation:test_data_rwr_no_structure, now it should raise aInvaliderror (because it has no structre andSchemait's required)test_read_xml_write_csv_10andtest_read_xml_write_csv_20XML tests
tests/io/xml/sdmx21/test/reader::test_read_write_structure_specific_all: added a Schema to the structure.tests/io/xml/sdmx21/test/writertest_data_writing.py::test_data_scape_quote: added a Schemagen_ser.xmlandtest_generic_with_groups.xml)tests/io/xml/sdmx30/test/reader::test_data_dataflow_31: added a SchemaFixed to fit into the new null values handling
tests/io/test_writing_validation.pytest_data_rwrandtest_write_sdmx_csv_read_back: updated the spected data to fit the new null and empty handletests/io/xmls/sdmx21/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spectedtests/io/xmls/sdmx30/test_writing_validation::test_data_write_nullable_nulltypes: extended test to verify it works as spectedAdded to test the new null values handling