Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions docs/config-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,29 @@ I path relativi sono sempre risolti rispetto alla directory che contiene `datase

`raw.sources[].args` e `raw.extractor.args` devono essere sempre oggetti YAML, non liste o stringhe.

Esempio `ckan`:

```yaml
raw:
sources:
- name: bdap_lea
type: ckan
client:
timeout: 60
retries: 2
args:
portal_url: https://bdap-opendata.rgs.mef.gov.it/SpodCkanApi/api/3
dataset_id: "d598ebd9-949d-4214-bb33-cd9c1be08f15"
resource_id: "33344"
```

Note pratiche per `ckan`:

- il toolkit interroga `resource_show` prima del download
- se `resource_show` non e disponibile o non risolve il file, il toolkit ripiega su `package_show`
- se il portale restituisce un file URL in `http://`, il toolkit lo forza automaticamente a `https://`
- se `filename` non e dichiarato, il toolkit prova a inferire l'estensione dall'URL risolto

## clean

| Campo | Tipo | Default |
Expand Down
2 changes: 1 addition & 1 deletion smoke/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ Di conseguenza:

Progetti inclusi:

- `smoke/ispra_http_csv`: `http_file` contro server locale `http.server`
- `smoke/local_file_csv`: `local_file` completamente offline
- `smoke/zip_http_csv`: `http_file` + extractor ZIP (`unzip_first_csv`) contro server locale
- `smoke/bdap_http_csv`: `http_file` contro CSV pubblico BDAP
- `smoke/bdap_ckan_csv`: `ckan` contro OpenBDAP, con fallback `package_show` e force `https`
- `smoke/finanze_http_zip_2023`: `http_file` contro ZIP pubblico reale, best-effort

Ogni progetto include:
Expand Down
28 changes: 28 additions & 0 deletions smoke/bdap_ckan_csv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# bdap_ckan_csv

Smoke manuale per `ckan` contro OpenBDAP, con fallback `resource_show -> package_show` e force `https`.

## Comandi

```bash
toolkit run raw --config dataset.yml
toolkit profile raw --config dataset.yml
toolkit run clean --config dataset.yml
toolkit run mart --config dataset.yml
toolkit status --dataset bdap_ckan_csv --year 2022 --latest --config dataset.yml
```

## Verifiche attese

- `./_smoke_out/data/raw/bdap_ckan_csv/2022/manifest.json`
- `./_smoke_out/data/raw/bdap_ckan_csv/2022/raw_validation.json`
- `./_smoke_out/data/raw/bdap_ckan_csv/2022/_profile/raw_profile.json`
- `./_smoke_out/data/raw/bdap_ckan_csv/2022/_profile/suggested_read.yml`
- `./_smoke_out/data/clean/bdap_ckan_csv/2022/metadata.json` con `read_params_source`, `read_source_used`, `read_params_used`
- `./_smoke_out/data/mart/bdap_ckan_csv/2022/mart_ok.parquet`

## Note

- questo smoke usa un dataset OpenBDAP reale
- il portale espone `package_show`, ma `resource_show` non risolve: il caso serve proprio a verificare il fallback del plugin
- l'URL file restituito dal portale puo' arrivare in `http://`: il toolkit lo forza a `https://` prima del download
29 changes: 29 additions & 0 deletions smoke/bdap_ckan_csv/dataset.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
root: "./_smoke_out"

dataset:
name: "bdap_ckan_csv"
years: [2022]

raw:
sources:
- type: "ckan"
args:
portal_url: "https://bdap-opendata.rgs.mef.gov.it/SpodCkanApi/api/3"
dataset_id: "d598ebd9-949d-4214-bb33-cd9c1be08f15"
resource_id: "33344"
filename: "bdap_lea_2024.csv"

clean:
sql: "sql/clean.sql"
read:
delim: ";"
decimal: "."
encoding: "utf-8"
header: true
columns: null
validate: {}

mart:
tables:
- name: "mart_ok"
sql: "sql/mart/mart_ok.sql"
16 changes: 16 additions & 0 deletions smoke/bdap_ckan_csv/sql/clean.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
WITH base AS (
SELECT
TRY_CAST(TRIM(CAST("Anno di Riferimento" AS VARCHAR)) AS INTEGER) AS anno,
TRY_CAST(TRIM(CAST("Codice Regione" AS VARCHAR)) AS INTEGER) AS codice_regione,
TRIM(CAST("Descrizione Regione" AS VARCHAR)) AS regione,
TRY_CAST(TRIM(CAST("Codice Ente SSN" AS VARCHAR)) AS INTEGER) AS codice_ente_ssn,
TRIM(CAST("Descrizione Ente" AS VARCHAR)) AS descrizione_ente,
TRIM(CAST("Codice Voce Contabile" AS VARCHAR)) AS codice_voce_contabile,
TRIM(CAST("Descrizione Voce Contabile" AS VARCHAR)) AS descrizione_voce_contabile,
TRY_CAST(TRIM(CAST("Importo Totale" AS VARCHAR)) AS DOUBLE) AS importo_totale
FROM raw_input
)

SELECT *
FROM base
WHERE anno IS NOT NULL;
10 changes: 10 additions & 0 deletions smoke/bdap_ckan_csv/sql/mart/mart_ok.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
SELECT
anno,
regione,
descrizione_voce_contabile,
COUNT(*) AS righe,
SUM(importo_totale) AS importo_totale
FROM clean_input
WHERE anno IS NOT NULL
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3
35 changes: 0 additions & 35 deletions smoke/ispra_http_csv/README.md

This file was deleted.

31 changes: 0 additions & 31 deletions smoke/ispra_http_csv/dataset.yml

This file was deleted.

3 changes: 0 additions & 3 deletions smoke/ispra_http_csv/fixtures/ispra_http_sample.csv

This file was deleted.

1 change: 0 additions & 1 deletion smoke/ispra_http_csv/sql/clean.sql

This file was deleted.

1 change: 0 additions & 1 deletion smoke/ispra_http_csv/sql/mart/mart_ok.sql

This file was deleted.

133 changes: 133 additions & 0 deletions tests/test_ckan_plugin.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
from toolkit.core.exceptions import DownloadError
from toolkit.plugins.ckan import CkanSource


class _FakeResponse:
def __init__(self, status_code: int, *, json_data=None, content: bytes = b"", url: str = "https://example.org"):
self.status_code = status_code
self._json_data = json_data
self.content = content
self.url = url

def json(self):
return self._json_data


def test_ckan_fetch_resource_show_forces_https(monkeypatch):
calls = []

def _fake_get(url, params=None, timeout=None, headers=None):
calls.append((url, params))
if "resource_show" in url:
return _FakeResponse(
200,
json_data={
"success": True,
"result": {"url": "http://portal.example.org/export/data.csv"},
},
url=f"{url}?id=abc",
)
return _FakeResponse(
200,
content=b"a,b\n1,2\n",
url="https://portal.example.org/export/data.csv",
)

monkeypatch.setattr("toolkit.plugins.ckan.requests.get", _fake_get)

payload, origin = CkanSource().fetch("https://portal.example.org/api/3", resource_id="abc")

assert payload == b"a,b\n1,2\n"
assert origin == "https://portal.example.org/export/data.csv"
assert calls[1][0] == "https://portal.example.org/export/data.csv"


def test_ckan_fetch_falls_back_to_package_show(monkeypatch):
calls = []

def _fake_get(url, params=None, timeout=None, headers=None):
calls.append((url, params))
if "resource_show" in url:
return _FakeResponse(404, json_data={}, url=f"{url}?id=33344")
if "package_show" in url:
return _FakeResponse(
200,
json_data={
"success": True,
"result": {
"resources": [
{
"id": 33344,
"name": "csv dump",
"format": "CSV",
"url": "http://portal.example.org/api/3/datastore/dump/dataset.csv",
}
]
},
},
url=f"{url}?id=dataset-id",
)
return _FakeResponse(
200,
content=b"a,b\n1,2\n",
url="https://portal.example.org/api/3/datastore/dump/dataset.csv",
)

monkeypatch.setattr("toolkit.plugins.ckan.requests.get", _fake_get)

payload, origin = CkanSource().fetch(
"https://portal.example.org/api/3",
resource_id="33344",
dataset_id="dataset-id",
)

assert payload == b"a,b\n1,2\n"
assert origin == "https://portal.example.org/api/3/datastore/dump/dataset.csv"
assert any("package_show" in call[0] for call in calls)


def test_ckan_fetch_requires_identifier():
try:
CkanSource().fetch("https://portal.example.org/api/3")
except DownloadError as exc:
assert "resource_id or dataset_id" in str(exc)
else:
raise AssertionError("Expected DownloadError")


def test_ckan_fetch_rejects_package_fallback_when_resource_id_missing(monkeypatch):
def _fake_get(url, params=None, timeout=None, headers=None):
if "resource_show" in url:
return _FakeResponse(404, json_data={}, url=f"{url}?id=99999")
if "package_show" in url:
return _FakeResponse(
200,
json_data={
"success": True,
"result": {
"resources": [
{
"id": 33344,
"name": "csv dump",
"format": "CSV",
"url": "http://portal.example.org/api/3/datastore/dump/dataset.csv",
}
]
},
},
url=f"{url}?id=dataset-id",
)
raise AssertionError(f"Unexpected download request to {url}")

monkeypatch.setattr("toolkit.plugins.ckan.requests.get", _fake_get)

try:
CkanSource().fetch(
"https://portal.example.org/api/3",
resource_id="99999",
dataset_id="dataset-id",
)
except DownloadError as exc:
assert "resource_id=99999" in str(exc)
else:
raise AssertionError("Expected DownloadError")
30 changes: 30 additions & 0 deletions tests/test_raw_ext_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,41 @@ def test_infer_ext_http_csv_php_and_zip_php():
assert _infer_ext("http_file", {"url": "https://example.org/archive.zip.php"}) == ".zip"


def test_infer_ext_ckan_uses_resolved_origin():
assert _infer_ext("ckan", {}, origin="https://example.org/dump/data.csv") == ".csv"
assert _infer_ext("ckan", {}, origin="https://example.org/archive.zip.php") == ".zip"


def test_infer_ext_never_returns_php():
assert _infer_ext("http_file", {"url": "https://example.org/download.php?id=42"}) != ".php"
assert _infer_ext("local_file", {"path": "C:/tmp/file.php"}) != ".php"


def test_run_raw_ckan_filename_inferred_from_resolved_url(monkeypatch, tmp_path: Path):
def _fake_fetch_payload(_stype: str, _client: dict, _formatted_args: dict):
return b"a,b\n1,2\n", "https://example.org/data.csv"

monkeypatch.setattr("toolkit.raw.run._fetch_payload", _fake_fetch_payload)

raw_cfg = {
"sources": [
{
"name": "bdap_resource",
"type": "ckan",
"args": {
"portal_url": "https://portal.example.org/SpodCkanApi/api/3",
"resource_id": "33344",
},
}
]
}

run_raw("demo", 2024, str(tmp_path), raw_cfg, _NoopLogger())

out_dir = tmp_path / "data" / "raw" / "demo" / "2024"
assert (out_dir / "bdap_resource.csv").exists()


def test_run_raw_filename_override_has_priority(monkeypatch, tmp_path: Path):
def _fake_fetch_payload(_stype: str, _client: dict, _formatted_args: dict):
return b"a,b\n1,2\n", "https://example.org/dataset.csv.php"
Expand Down
1 change: 1 addition & 0 deletions tests/test_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@ def test_register_builtin_plugins_registers_present_plugins():
register_builtin_plugins(registry_obj=r)

plugins = r.list_plugins()
assert "ckan" in plugins
assert "http_file" in plugins
assert "local_file" in plugins
Loading
Loading