Aggregate próximo pozo estimates from vetted community mirrors, enforce provenance, and publish Google Sheets updates without touching polla.cl.
- Orchestrates multi-source ingestion with a unified registry (
pozos,resultadoslotochile,openloto) and deterministic fallbacks. - Ensures data integrity via SHA-256 content-hash verification and magnitude-based consensus quarantine (10% threshold).
- Publishes structured JSONL outputs and comparison reports with full provenance traceability.
- Ships a Click-based CLI (
run,publish,pozos,health) with dry-run diffing and automated guardrails. - Handles rate-limiting gracefully with jittered exponential backoff and polite robots.txt enforcement.
- Locks behaviour with fixture-driven pytest suites and doctests executed in CI for documentation drift.
- Simplifies day-to-day DX with Make targets, Black/Ruff/Mypy automation, and GitHub Actions parity.
- Python 3.10+, Click CLI, Requests + BeautifulSoup parsers
- Google Sheets integration via
gspread+google-auth - Testing: Pytest (+ doctests), Faker fixtures
- Tooling: Ruff, Black, Mypy, GitHub Actions (tests, docs, health)
%%{init: {"themeVariables": {"fontSize":"16px"}, "flowchart": {"htmlLabels": false, "wrap": true}}}%%
flowchart TB
A[CLI command] --> B[Pipeline Orchestrator]
B --> C{Source loader}
C -->|ResultadosLotoChile| D[Primary scrape]
C -->|OpenLoto fallback| E[Fallback scrape]
D --> F[Normalizer]
E --> F[Normalizer]
F --> G["Artifacts<br/>(JSONL, reports, state)"]
G --> H{Publish?}
H -->|Yes| I[Google Sheets via gspread]
H -->|No| J[Quarantine + logs]
B --> K["Structured logging<br/>(spans + metrics)"]
-
Ensure Python 3.10+ is available (use
pyenv local 3.10.13or your preferred manager). -
Create an isolated environment and install dependencies:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt pip install -r requirements-dev.txt -
Run the pozos pipeline locally:
python -m polla_app run \ --sources pozos \ --normalized artifacts/normalized.jsonl \ --comparison-report artifacts/comparison_report.json \ --summary artifacts/run_summary.json
-
Optional: dry-run publishing to Google Sheets once credentials are configured:
python -m polla_app publish \ --normalized artifacts/normalized.jsonl \ --comparison-report artifacts/comparison_report.json \ --summary artifacts/run_summary.json \ --worksheet "Normalized" \ --discrepancy-tab "Discrepancies" \ --dry-run
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
GOOGLE_SPREADSHEET_ID |
string | — | For publish |
Target worksheet key for Google Sheets publishing. |
GOOGLE_SERVICE_ACCOUNT_JSON |
JSON string | — | Conditional | Inline service account credentials (alternative to file). |
GOOGLE_CREDENTIALS / CREDENTIALS |
JSON string | — | Conditional | Legacy env vars recognised for service account auth. |
service_account.json |
file | — | Conditional | Disk-based credentials if env vars are not supplied. |
ALT_SOURCE_URLS |
JSON string | {} |
No | Override source URLs for mirrors or testing. |
POLLA_USER_AGENT |
string | Library default | No | Custom HTTP user agent for polite scraping. |
POLLA_RATE_LIMIT_RPS |
float | unset | No | Per-host requests-per-second throttle. |
pytest -q– executes unit/integration suites with offline fixtures; expectN passedin <10s.ruff check polla_app tests– enforces linting, naming, and import hygiene.mypy polla_app– verifies strict typing (3rd-party stubs ignored where unavailable).black --check polla_app tests– maintains consistent formatting.pytest --doctest-glob='*.md' README.md docs -q– ensures documentation examples stay executable.
CI mirrors these commands through .github/workflows/tests.yml and .github/workflows/docs.yml so local runs match automation. Add pytest --cov=polla_app when you need a coverage report.1
- Scheduled
health.ymlworkflow exercises offline health checks daily to catch data source drift before operators do. scripts/benchmark_pozos_parsing.pyoffers a quick regression guard for parsing speed—keep median scrape under 150ms on commodity hardware.- Structured metrics emitted via
polla_app.obs.metricsimplify alerting and feed SLO reviews (docs/SLOs.md).
- Expand publish command to surface mismatch deltas via Slack/webhooks for quicker operator response.
- Wire Codecov and fail PRs below agreed coverage thresholds.1
- Add smoke-test fixtures for newly emerging aggregator mirrors.
- Demonstrates operational empathy: dry-run defaults, quarantine support, and explicit provenance reduce on-call stress.
- Highlights disciplined scraping practices respectful of third-party infrastructure and legal boundaries.
- Shows ability to automate reliability checks end-to-end (health workflow, observability hooks, structured metrics).
- Illustrates developer-experience focus through reproducible CLI, Make targets, and strict typing/linting gates.
- Proves comfort with secure credential handling when integrating with Google Workspace APIs.
Contributions are welcome—see CONTRIBUTING.md for style, testing, and review expectations.
This project is distributed under the MIT License.