Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,31 @@
# Contributing

## Test Suite Tiers

The test suite is intentionally stratified so release-critical coverage stays visible.

- `core`: public contract and canonical workflow
- config and strict mode
- path contract relative to `dataset.yml`
- `run all`, `validate all`, `status`, `inspect paths`
- end-to-end RAW -> CLEAN -> MART
- run records, resume, validation layers
- `advanced`: supported but non-happy-path or secondary engine behavior
- detailed read modes and selection logic
- extractors and plugin registry
- profiling details
- artifact policy and logging helpers
- `compat`: compatibility-only behavior
- deprecated import shims and legacy-only coverage

Useful commands:

```bash
py -m pytest -m core
py -m pytest -m "core or advanced"
py -m pytest -m compat
```

## Git Hook

Install the lightweight pre-commit guardrail before contributing:
Expand Down
26 changes: 24 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,20 @@ Ruoli delle repo correlate:

Questa repo non e' l'hub dell'organizzazione e non replica la documentazione org-wide: resta focalizzata sul motore e sul suo contratto tecnico.

## Confini Del Toolkit

Il toolkit espone un perimetro volutamente stretto:

- core runtime: `raw`, `clean`, `mart`, `run`, `validate`, `status`, `inspect`
- advanced tooling: `resume`, `profile raw`, run parziali per layer
- compatibility only: alias legacy e shim deprecati

Regola pratica:

- nuovi repo dataset: resta nel workflow canonico
- recovery o diagnostica: usa gli strumenti advanced
- bootstrap o compatibilita': non trattarli come parte del contratto stabile

## Obiettivi

- mantenere una struttura progetto semplice: `dataset.yml` + `sql/`
Expand All @@ -31,7 +45,7 @@ Il toolkit include:
- pipeline `raw`, `clean`, `mart`
- validation gate post-layer integrato in `run`
- run tracking persistente in `data/_runs/...`
- comandi CLI `run`, `resume`, `status`, `validate`, `profile`, `gen-sql`
- comandi CLI `run`, `resume`, `status`, `validate`, `profile`, `inspect`
- `project-example/` offline per smoke test locale

## Installazione
Expand Down Expand Up @@ -102,6 +116,7 @@ Schema completo e legacy supportato: [docs/config-schema.md](docs/config-schema.
Flow avanzati e tooling secondario: [docs/advanced-workflows.md](docs/advanced-workflows.md)
Matrice di stabilita`: [docs/feature-stability.md](docs/feature-stability.md)
Contratto notebook/output: [docs/notebook-contract.md](docs/notebook-contract.md)
Confini runtime e superfici non-core: [docs/runtime-boundaries.md](docs/runtime-boundaries.md)
Per policy condivise e community health organizzativa, fai riferimento alla repo `.github` dell'organizzazione.

Artefatti attesi:
Expand Down Expand Up @@ -189,7 +204,7 @@ toolkit inspect paths --config dataset.yml --year 2024 --json
toolkit run all --config dataset.yml --dry-run --strict-config
```

`resume`, `profile raw`, `run raw|clean|mart`, `gen-sql` e la policy completa degli artifacts restano disponibili, ma sono tooling avanzato: vedi [docs/advanced-workflows.md](docs/advanced-workflows.md).
`resume`, `profile raw`, `run raw|clean|mart` e la policy completa degli artifacts restano disponibili, ma sono tooling avanzato: vedi [docs/advanced-workflows.md](docs/advanced-workflows.md).

## Notebook locali

Expand All @@ -208,6 +223,13 @@ Helper ufficiale per evitare path logic duplicata nei notebook:
toolkit inspect paths --config dataset.yml --year 2024 --json
```

Ruoli stabili degli output:

- `metadata.json`: payload ricco del layer
- `manifest.json`: summary stabile del layer con puntatori a metadata e validation
- `data/_runs/.../<run_id>.json`: stato del run usato da `status` e `resume`
- `inspect paths --json`: discovery helper read-only per notebook e script locali

Questo mantiene il contratto semplice tra toolkit e repo dataset:

- il toolkit produce artefatti e metadata stabili
Expand Down
16 changes: 6 additions & 10 deletions docs/advanced-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ Percorso canonico:
- `toolkit status --dataset <dataset> --year <year> --latest --config dataset.yml`
- notebook locali che leggono output e metadata sotto `root/data/...`

Questa categoria include anche tooling di supporto che non va confuso con il runtime principale del toolkit:

- `toolkit.profile`
- `resume`
- run parziali per layer

## Step singoli

Utili per debug o per ripetere solo una parte della pipeline:
Expand Down Expand Up @@ -80,16 +86,6 @@ Regola pratica:

`legacy_aliases` resta supportato per compatibilita`, ma non va promosso nei nuovi repo dataset.

## gen-sql

`toolkit gen-sql --config dataset.yml` resta disponibile come bootstrap helper da mapping dichiarativo.

Stato raccomandato:

- supportato
- utile per bootstrap guidato
- non parte del workflow operativo standard
- da considerare congelato: bugfix si`, espansioni solo se emerge uso reale

## Compat legacy

Expand Down
18 changes: 0 additions & 18 deletions docs/config-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,6 @@ I path relativi sono sempre risolti rispetto alla directory che contiene `datase
| `clean.read_mode` | `strict \| fallback \| robust` | `fallback` |
| `clean.read_source` | `auto \| config_only \| null` | `null` |
| `clean.read` | `CleanRead \| null` | `null` |
| `clean.mapping` | `dict[str, CleanMappingSpec] \| null` | `null` |
| `clean.derive` | `dict[str, CleanDeriveFieldConfig] \| null` | `null` |
| `clean.required_columns` | `list[str]` | `[]` |
| `clean.validate` | `CleanValidate` | `{}` |

Expand Down Expand Up @@ -114,22 +112,6 @@ I path relativi sono sempre risolti rispetto alla directory che contiene `datase
| `min` | `float \| null` | `null` |
| `max` | `float \| null` | `null` |

`CleanMappingSpec.parse` shape minima:

| Campo | Tipo | Default |
|---|---|---|
| `kind` | `string \| null` | `null` |
| `locale` | `string \| null` | `null` |
| `options` | `dict[string,any] \| null` | `null` |

`clean.derive` shape minima:

| Campo | Tipo | Default |
|---|---|---|
| `expr` | `string` | nessuno |

`clean.mapping.*.parse` e `clean.derive.*` devono essere oggetti YAML, non stringhe o liste.

## mart

| Campo | Tipo | Default |
Expand Down
1 change: 1 addition & 0 deletions docs/conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Questa pagina raccoglie i contratti operativi stabili del toolkit per la pipelin
- Il file JSON canonico e` `raw_profile.json`, ma viene scritto solo per policy `standard|debug`.
- `profile.json` resta un alias di compatibilita` opzionale, controllato da `output.legacy_aliases`.
- `suggested_read.yml` e` il contratto usato da CLEAN per i format hints e resta richiesto solo quando `clean.read.source: auto`.
- `suggested_mapping.yml` resta un artefatto diagnostico opzionale per uso umano; non e` un input del runtime canonico del toolkit.

## Artifacts Policy

Expand Down
8 changes: 5 additions & 3 deletions docs/feature-stability.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@ Questa matrice serve a chiarire cosa il toolkit considera percorso canonico, cos
| artifact policy `minimal|standard|debug` | supported / advanced | tuning operativo |
| `legacy_aliases` | compatibility only | non promuovere nei repo nuovi |
| config legacy | compatibility only | usare `--strict-config` nei repo nuovi |
| `gen-sql` | frozen helper | bootstrap guidato, non workflow standard |
| `api_json_paged` | experimental | usare solo con evidenza reale |
| `html_table` | experimental | usare solo con evidenza reale |
Lettura equivalente a livello package:

- core runtime: `toolkit.raw`, `toolkit.clean`, `toolkit.mart`, `toolkit.cli` (`run`, `validate`, `status`, `inspect`)
- advanced tooling: `toolkit.profile`, `resume`, run parziali
- compatibility only: config legacy e alias storici

Regola pratica:

Expand Down
11 changes: 11 additions & 0 deletions docs/notebook-contract.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,23 @@ File utili:
- CLEAN: `<dataset>_<year>_clean.parquet`, `manifest.json`, `metadata.json`
- MART: `<table>.parquet`, `manifest.json`, `metadata.json`

Ruoli dei file:

- `metadata.json`: payload ricco del layer. Contiene input, output, `config_hash` e campi specifici del layer.
- `manifest.json`: summary stabile del layer. Punta a metadata e validation e riassume `ok/errors_count/warnings_count`.
- run record in `data/_runs/...`: stato del run (`run_id`, layer, validations, status), utile per `status` e `resume`.
- `inspect paths --json`: helper read-only per notebook e script locali; restituisce i path assoluti utili del runtime, incluso `latest_run`.

Per evitare duplicazione di path logic nei notebook:

- leggi `dataset.yml`
- usa `toolkit inspect paths --config dataset.yml --year <year> --json`
- poi apri parquet, metadata, manifest, validation e run record dai path restituiti

Nota pratica:

- `inspect paths` restituisce path assoluti della macchina locale: e' pensato per notebook e script nello stesso ambiente, non come formato portabile tra macchine diverse.

Regola pratica:

- il toolkit produce
Expand Down
39 changes: 39 additions & 0 deletions docs/runtime-boundaries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Runtime Boundaries

Questa nota chiarisce quali parti del package `toolkit/` fanno davvero parte del runtime principale e quali no.

## Core Runtime

Queste aree definiscono il contratto stabile del toolkit:

- `toolkit.raw`
- `toolkit.clean`
- `toolkit.mart`
- `toolkit.cli` per `run`, `validate`, `status`, `inspect`
- `toolkit.core` per config, path, metadata, run tracking e validation

Sono le superfici che i repo dataset e il `project-template` dovrebbero considerare centrali.

## Advanced Tooling

Queste aree restano supportate, ma non fanno parte del percorso canonico:

- `toolkit.profile`
- `toolkit.cli.cmd_resume`
- `toolkit.cli.cmd_profile`
- esecuzione parziale `run raw|clean|mart`

Servono per recovery, diagnostica e casi sporchi, non come baseline per i repo nuovi.

## Compatibility Only

La compatibilita' mantenuta dal toolkit riguarda soprattutto config legacy, alias documentati e alcune superfici CLI storiche.

Non va trattata come parte del contratto stabile per i repo nuovi.

## Builtin Sources

Le sorgenti builtin supportate dal runtime canonico sono:

- `local_file`
- `http_file`
4 changes: 4 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
[pytest]
pythonpath = .
markers =
core: release-critical tests for the public contract and canonical workflow
advanced: supported but non-happy-path or secondary engine behavior
compat: compatibility and legacy-shim coverage
2 changes: 0 additions & 2 deletions scripts/usage_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@


TARGETS = {
"toolkit.core.validators": "toolkit.core.validators",
"toolkit.core.validation_summary": "toolkit.core.validation_summary",
"dataset.test.yml": "dataset.test.yml",
"clean_ispra_test.sql": "clean_ispra_test.sql",
}
Expand Down
3 changes: 2 additions & 1 deletion smoke/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,14 @@ Progetti inclusi:
- `smoke/local_file_csv`: `local_file` completamente offline
- `smoke/zip_http_csv`: `http_file` + extractor ZIP (`unzip_first_csv`) contro server locale
- `smoke/bdap_http_csv`: `http_file` contro CSV pubblico BDAP
- `smoke/finanze_http_zip_2023`: `http_file` contro ZIP pubblico reale, best-effort

Ogni progetto include:

- `dataset.yml` minimo
- `sql/clean.sql`
- `sql/mart/mart_ok.sql`
- `README.md` con i comandi `toolkit run raw/profile/clean/mart/status`
- `README.md` con i comandi del caso smoke, incluso un `--dry-run --strict-config` iniziale

Prerequisito:

Expand Down
50 changes: 50 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from __future__ import annotations

import pytest


CORE_TESTS = {
"test_cli_all_commands.py",
"test_cli_inspect_paths.py",
"test_cli_path_contract.py",
"test_cli_resume.py",
"test_cli_status.py",
"test_config.py",
"test_metadata_hash.py",
"test_paths.py",
"test_project_example_e2e.py",
"test_run_context.py",
"test_run_dry_run.py",
"test_run_validation_gate.py",
"test_smoke_tiny_e2e.py",
"test_validate_layers.py",
"test_validate_rules.py",
}

ADVANCED_TESTS = {
"test_artifacts_policy.py",
"test_clean_csv_columns.py",
"test_clean_duckdb_read.py",
"test_clean_input_selection.py",
"test_extractors.py",
"test_logging_context.py",
"test_profile_sniff.py",
"test_raw_ext_inference.py",
"test_raw_profile_hints.py",
"test_registry.py",
}

COMPAT_TESTS = {
"test_deprecated_shims.py",
}


def pytest_collection_modifyitems(items: list[pytest.Item]) -> None:
for item in items:
name = item.path.name
if name in CORE_TESTS:
item.add_marker(pytest.mark.core)
elif name in ADVANCED_TESTS:
item.add_marker(pytest.mark.advanced)
elif name in COMPAT_TESTS:
item.add_marker(pytest.mark.compat)
6 changes: 3 additions & 3 deletions tests/test_clean_input_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from toolkit.clean.input_selection import list_raw_candidates, select_inputs
from toolkit.clean.run import run_clean
from toolkit.core.manifest import write_manifest
from toolkit.core.manifest import write_raw_manifest


class _NoopLogger:
Expand Down Expand Up @@ -139,7 +139,7 @@ def test_run_clean_uses_manifest_primary(tmp_path: Path, monkeypatch):
selected_file.write_text("a\n1\n", encoding="utf-8")
other_file = raw_dir / "other.csv"
other_file.write_text("a\n2\n", encoding="utf-8")
write_manifest(
write_raw_manifest(
raw_dir,
{
"dataset": "demo",
Expand Down Expand Up @@ -342,7 +342,7 @@ def test_clean_manifest_points_missing_file_falls_back_and_warns(tmp_path: Path,
raw_dir = tmp_path / "data" / "raw" / "demo" / "2024"
_write_csv(raw_dir / "small.csv", "a\n1\n")
large_file = _write_csv(raw_dir / "large.csv", "a\n" + ("1\n" * 20))
write_manifest(
write_raw_manifest(
raw_dir,
{
"dataset": "demo",
Expand Down
41 changes: 0 additions & 41 deletions tests/test_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -774,47 +774,6 @@ def test_load_config_model_rejects_unknown_section_keys_in_strict_mode(
""".strip(),
"raw.extractor.args",
),
(
"""
dataset:
name: demo
years: [2022]
raw: {}
clean:
derive: []
mart: {}
""".strip(),
"clean.derive",
),
(
"""
dataset:
name: demo
years: [2022]
raw: {}
clean:
derive:
total: "value * 2"
mart: {}
""".strip(),
"clean.derive.total",
),
(
"""
dataset:
name: demo
years: [2022]
raw: {}
clean:
mapping:
totale:
from: valore
type: float
parse: "number_it"
mart: {}
""".strip(),
"clean.mapping.totale.parse",
),
],
)
def test_load_config_model_rejects_wrong_shape_for_typed_subsections(
Expand Down
Loading