Releases: seven7-AI/CBK-scraper
Releases · seven7-AI/CBK-scraper
v1.0.0 – CBK Treasury Scraper & OCR Pipeline
Summary
This release introduces a production-ready pipeline for downloading and processing
Central Bank of Kenya (CBK) Treasury Bond and Treasury Bill result PDFs, with
deduplication, OCR processing, Redis tracking, and Windows-compatible daily jobs.
Features
-
Treasury Bonds scraper
- Scrapes
https://www.centralbank.go.ke/bills-bonds/treasury-bonds/ - Handles DataTables pagination / “Show All”
- Downloads all result PDFs into
downloads/bonds/
- Scrapes
-
Treasury Bills scraper
- Scrapes
https://www.centralbank.go.ke/bills-bonds/treasury-bills/for:- 91-day (
#table_2) - 182-day (
#table_3) - 364-day (
#table_4)
- 91-day (
- Stronger DOM waits for DataTables so 91/182/364 PDF links are reliably captured
- Downloads PDFs into
downloads/bills/
- Scrapes
-
Download deduplication (SQLite + Redis)
- SQLite registry (
data/registry.db) tracks(url, local_path, downloaded_at, source) - Redis set
cbk:scraper:downloaded_urlsprevents re-downloading already-scraped URLs - Idempotent runs: safe to run multiple times per day
- SQLite registry (
-
Structured JSON logging
- Shared JSON formatter in
cbk_common.logging_utils - Scraper and OCR logs include:
event,pdf_url,pdf_path,sourcefile_size_bytes,pages,duration_ms(where applicable)
- Logs written to
logs/and stdout
- Shared JSON formatter in
-
Redis metrics
- Scraper per-run hash:
cbk:scraper:run:<YYYYMMDD>(downloaded / skipped / failed) - OCR per-run hash:
cbk:ocr:run:<YYYYMMDD>(processed / skipped / failed by source) - No TTL on dedup sets; TTL on run hashes (14 days) for recent history
- Scraper per-run hash:
-
OCR processing pipeline (
cbk_ocr)TextFirstOcrEngineusingpdfplumberto extract per-page text- Walks
downloads/bonds/anddownloads/bills/ - Skips files already processed (Redis:
cbk:ocr:processed_files) - Outputs:
- Markdown:
processed/markdown/{bonds,bills}/file.md - JSON:
processed/json/{bonds,bills}/file.jsonwith pages + metadata
- Markdown:
- CLI:
python -m cbk_ocr.run_ocrpython -m cbk_ocr.run_ocr --limit 5(for testing)
-
Daily scheduling (Windows)
- Legacy single-task script:
scripts/schedule_daily_windows.ps1→ one daily scraper task
- Dual-task script:
scripts/schedule_daily_jobs_windows.ps1creates:CBK-Scraper-10AM→python -m cbk_scraper.runCBK-OCR-12PM→python -m cbk_ocr.run_ocr
- Times are configurable via
-ScraperHour/-ScraperMinute/-OcrHour/-OcrMinute
- Legacy single-task script: