Skip to content

Setdiff operator return matching values whit nulls #611

@mla2001

Description

@mla2001

Initial Checks

Summary

Currently, the implementation of the SetDiff operator performs a left join between the two operands by ids. If a combination of ids is not present, the measures and attributes of the operand will appear as null in the result.

Then, a filter is applied to check if there are any null values.

The problem with this implementation is that if there are pre-existing null values, they will pass the filter when they shouldn't.

Reproducible Example

script = "DS_r <- setdiff(DS_1, DS_2);"

    data_structures = {
        "datasets": [
            {
                "name": "DS_1",
                "DataStructure": [
                    {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                    {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
                    {"name": "At_1", "type": "Number", "role": "Measure", "nullable": True},
                ],
            },
            {
                "name": "DS_2",
                "DataStructure": [
                    {"name": "Id_1", "type": "Integer", "role": "Identifier", "nullable": False},
                    {"name": "Me_1", "type": "Number", "role": "Measure", "nullable": True},
                    {"name": "At_1", "type": "Number", "role": "Measure", "nullable": True},
                ],
            },
        ]
    }

    datapoints = {
        "DS_1": pd.DataFrame({"Id_1": [1, 2, 3], "Me_1": [1, 2, 3], "At_1": [1, 2, 3]}),
        # At_1 not defined, will be filled with nulls
        "DS_2": pd.DataFrame({"Id_1": [3, 4, 5], "Me_1": [1, 2, 3]}),
    }

    print(run(script=script, data_structures=data_structures, datapoints=datapoints))

vtlengine version

1.6.0

Python version

Any

OS

Any

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions