-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Initial Checks
- I have searched existing issues for duplicates
Problem Description
DatasetSchedule.global_inputs is a flat list of all external dependencies without distinguishing datasets from scalars. This causes two problems:
- Imprecise validation:
_validate_extra_datasetschecks datasets against allglobal_inputs, which includes both datasets and scalars mixed together (to be introduced in Validate that data_structures does not contain datasets not referenced by the script #569) - Missing scalar tracking: Scalars used inside RegularAggregation (calc, filter, rename, etc.) end up in
unknown_variablesand are never included inglobal_inputsat all — a gap in dependency tracking.
Current behavior:
from vtlengine.API import create_ast
from vtlengine.AST.DAG import DAGAnalyzer
script = """
SC_r := SC_1 + SC_2;
DS_r <- DS_1[calc Me_2 := Me_1 + SC_r];
"""
ast = create_ast(script)
ds = DAGAnalyzer.ds_structure(ast)
print(ds.global_inputs) # ['DS_1'] — SC_1 and SC_2 are missing!Proposed Solution
Split global_inputs into four categories in DatasetSchedule, with global_inputs as their union (no duplicates):
| Field | Context | Behavior |
|---|---|---|
global_input_datasets |
Definite datasets (dataset operand of RegularAggregation, Identifier with kind="DatasetID", feeds into dataset ops) |
Must provide a dataset |
global_input_scalars |
Definite scalars (feeds only scalar chains from constants, identified via propagation) | Must provide a scalar |
global_input_dataset_or_scalar |
Top-level VarID ambiguity (e.g., DS_r <- X + 2 where X could be dataset or scalar) |
User may provide either |
global_input_component_or_scalar |
Inside RegularAggregation ambiguity (e.g., DS_1[calc Me_2 := Me_1 + X]) |
User may provide a scalar, but semantic error 1-1-6-11 is raised if it collides with a component name |
Expected behavior:
from vtlengine.API import create_ast
from vtlengine.AST.DAG import DAGAnalyzer
script = """
SC_r := SC_1 + SC_2;
DS_r <- DS_1[calc Me_2 := Me_1 + SC_r];
"""
ast = create_ast(script)
ds = DAGAnalyzer.ds_structure(ast)
print(ds.global_input_datasets) # ['DS_1']
print(ds.global_input_scalars) # ['SC_1', 'SC_2']
print(ds.global_input_dataset_or_scalar) # []
print(ds.global_input_component_or_scalar) # []
print(ds.global_inputs) # ['DS_1', 'SC_1', 'SC_2']Classification approach (pure AST analysis):
- Track which statements involve dataset operations (
has_dataset_opflag) — set when visitingRegularAggregation,JoinOp,Aggregation,Analytic,HROperation,DPValidation, orIdentifierwithkind == "DatasetID". - Identify scalar outputs via fixed-point propagation: seed from constant-only assignments (
a := 1), then propagate through statements with no dataset ops where all inputs are known scalars. - Classify global inputs using the decision tree:
- In
unknown_variables(from RegularAggregation) →global_input_component_or_scalar - Used as input to a statement with
has_dataset_op=True→global_input_datasets - Feeds ONLY into scalar chains (no dataset ops) →
global_input_scalars - Otherwise →
global_input_dataset_or_scalar
- In
Consumers updated:
_validate_extra_datasets→ useglobal_input_datasets+global_input_dataset_or_scalar_extract_input_datasets→ returnglobal_input_datasets+global_input_dataset_or_scalar_save_datapoints_efficient→ useglobal_input_datasets
Alternatives Considered
- API-layer classification: Cross-reference
global_inputswith loaded datasets/scalars dicts after both DAG analysis and data loading. Simpler but doesn't fix the missing scalar tracking gap (scalars in RegularAggregation never appear inglobal_inputs). - Two categories only (
global_input_datasets+global_input_scalars): Doesn't capture the ambiguity cases where the user can provide either type, or where component/scalar collision detection is needed.
Additional Context
Key files:
src/vtlengine/AST/DAG/_models.py—StatementDeps,DatasetSchedulesrc/vtlengine/AST/DAG/__init__.py—DAGAnalyzer,_ds_usage_analysis()src/vtlengine/API/__init__.py—_validate_extra_datasets,_extract_input_datasetssrc/vtlengine/Interpreter/__init__.py—_save_datapoints_efficient
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels