Skip to content

Discriminate between datasets and scalars in DatasetSchedule #571

@javihern98

Description

@javihern98

Initial Checks

Problem Description

DatasetSchedule.global_inputs is a flat list of all external dependencies without distinguishing datasets from scalars. This causes two problems:

  1. Imprecise validation: _validate_extra_datasets checks datasets against all global_inputs, which includes both datasets and scalars mixed together (to be introduced in Validate that data_structures does not contain datasets not referenced by the script #569)
  2. Missing scalar tracking: Scalars used inside RegularAggregation (calc, filter, rename, etc.) end up in unknown_variables and are never included in global_inputs at all — a gap in dependency tracking.

Current behavior:

from vtlengine.API import create_ast
from vtlengine.AST.DAG import DAGAnalyzer

script = """
    SC_r := SC_1 + SC_2;
    DS_r <- DS_1[calc Me_2 := Me_1 + SC_r];
"""

ast = create_ast(script)
ds = DAGAnalyzer.ds_structure(ast)

print(ds.global_inputs)  # ['DS_1'] — SC_1 and SC_2 are missing!

Proposed Solution

Split global_inputs into four categories in DatasetSchedule, with global_inputs as their union (no duplicates):

Field Context Behavior
global_input_datasets Definite datasets (dataset operand of RegularAggregation, Identifier with kind="DatasetID", feeds into dataset ops) Must provide a dataset
global_input_scalars Definite scalars (feeds only scalar chains from constants, identified via propagation) Must provide a scalar
global_input_dataset_or_scalar Top-level VarID ambiguity (e.g., DS_r <- X + 2 where X could be dataset or scalar) User may provide either
global_input_component_or_scalar Inside RegularAggregation ambiguity (e.g., DS_1[calc Me_2 := Me_1 + X]) User may provide a scalar, but semantic error 1-1-6-11 is raised if it collides with a component name

Expected behavior:

from vtlengine.API import create_ast
from vtlengine.AST.DAG import DAGAnalyzer

script = """
    SC_r := SC_1 + SC_2;
    DS_r <- DS_1[calc Me_2 := Me_1 + SC_r];
"""

ast = create_ast(script)
ds = DAGAnalyzer.ds_structure(ast)

print(ds.global_input_datasets)            # ['DS_1']
print(ds.global_input_scalars)             # ['SC_1', 'SC_2']
print(ds.global_input_dataset_or_scalar)   # []
print(ds.global_input_component_or_scalar) # []
print(ds.global_inputs)                    # ['DS_1', 'SC_1', 'SC_2']

Classification approach (pure AST analysis):

  1. Track which statements involve dataset operations (has_dataset_op flag) — set when visiting RegularAggregation, JoinOp, Aggregation, Analytic, HROperation, DPValidation, or Identifier with kind == "DatasetID".
  2. Identify scalar outputs via fixed-point propagation: seed from constant-only assignments (a := 1), then propagate through statements with no dataset ops where all inputs are known scalars.
  3. Classify global inputs using the decision tree:
    • In unknown_variables (from RegularAggregation) → global_input_component_or_scalar
    • Used as input to a statement with has_dataset_op=Trueglobal_input_datasets
    • Feeds ONLY into scalar chains (no dataset ops) → global_input_scalars
    • Otherwise → global_input_dataset_or_scalar

Consumers updated:

  • _validate_extra_datasets → use global_input_datasets + global_input_dataset_or_scalar
  • _extract_input_datasets → return global_input_datasets + global_input_dataset_or_scalar
  • _save_datapoints_efficient → use global_input_datasets

Alternatives Considered

  • API-layer classification: Cross-reference global_inputs with loaded datasets/scalars dicts after both DAG analysis and data loading. Simpler but doesn't fix the missing scalar tracking gap (scalars in RegularAggregation never appear in global_inputs).
  • Two categories only (global_input_datasets + global_input_scalars): Doesn't capture the ambiguity cases where the user can provide either type, or where component/scalar collision detection is needed.

Additional Context

Key files:

  • src/vtlengine/AST/DAG/_models.pyStatementDeps, DatasetSchedule
  • src/vtlengine/AST/DAG/__init__.pyDAGAnalyzer, _ds_usage_analysis()
  • src/vtlengine/API/__init__.py_validate_extra_datasets, _extract_input_datasets
  • src/vtlengine/Interpreter/__init__.py_save_datapoints_efficient

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions