Releases · seven7-AI/CBK-scraper

Summary

This release introduces a production-ready pipeline for downloading and processing
Central Bank of Kenya (CBK) Treasury Bond and Treasury Bill result PDFs, with
deduplication, OCR processing, Redis tracking, and Windows-compatible daily jobs.

Features

Treasury Bonds scraper
- Scrapes https://www.centralbank.go.ke/bills-bonds/treasury-bonds/
- Handles DataTables pagination / “Show All”
- Downloads all result PDFs into downloads/bonds/
Treasury Bills scraper
- Scrapes https://www.centralbank.go.ke/bills-bonds/treasury-bills/ for:
  - 91-day (#table_2)
  - 182-day (#table_3)
  - 364-day (#table_4)
- Stronger DOM waits for DataTables so 91/182/364 PDF links are reliably captured
- Downloads PDFs into downloads/bills/
Download deduplication (SQLite + Redis)
- SQLite registry (data/registry.db) tracks (url, local_path, downloaded_at, source)
- Redis set cbk:scraper:downloaded_urls prevents re-downloading already-scraped URLs
- Idempotent runs: safe to run multiple times per day
Structured JSON logging
- Shared JSON formatter in cbk_common.logging_utils
- Scraper and OCR logs include:
  - event, pdf_url, pdf_path, source
  - file_size_bytes, pages, duration_ms (where applicable)
- Logs written to logs/ and stdout
Redis metrics
- Scraper per-run hash: cbk:scraper:run:<YYYYMMDD> (downloaded / skipped / failed)
- OCR per-run hash: cbk:ocr:run:<YYYYMMDD> (processed / skipped / failed by source)
- No TTL on dedup sets; TTL on run hashes (14 days) for recent history
OCR processing pipeline (cbk_ocr)
- TextFirstOcrEngine using pdfplumber to extract per-page text
- Walks downloads/bonds/ and downloads/bills/
- Skips files already processed (Redis: cbk:ocr:processed_files)
- Outputs:
  - Markdown: processed/markdown/{bonds,bills}/file.md
  - JSON: processed/json/{bonds,bills}/file.json with pages + metadata
- CLI:
  - python -m cbk_ocr.run_ocr
  - python -m cbk_ocr.run_ocr --limit 5 (for testing)
Daily scheduling (Windows)
- Legacy single-task script:
  - scripts/schedule_daily_windows.ps1 → one daily scraper task
- Dual-task script:
  - scripts/schedule_daily_jobs_windows.ps1 creates:
    - CBK-Scraper-10AM → python -m cbk_scraper.run
    - CBK-OCR-12PM → python -m cbk_ocr.run_ocr
  - Times are configurable via -ScraperHour/-ScraperMinute/-OcrHour/-OcrMinute

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Summary

Features

Uh oh!

Releases: seven7-AI/CBK-scraper

v1.0.0 – CBK Treasury Scraper & OCR Pipeline

Summary

Features

Uh oh!