Skip to content

Releases: seven7-AI/CBK-scraper

v1.0.0 – CBK Treasury Scraper & OCR Pipeline

02 Mar 08:47

Choose a tag to compare

Summary

This release introduces a production-ready pipeline for downloading and processing
Central Bank of Kenya (CBK) Treasury Bond and Treasury Bill result PDFs, with
deduplication, OCR processing, Redis tracking, and Windows-compatible daily jobs.

Features

  • Treasury Bonds scraper

    • Scrapes https://www.centralbank.go.ke/bills-bonds/treasury-bonds/
    • Handles DataTables pagination / “Show All”
    • Downloads all result PDFs into downloads/bonds/
  • Treasury Bills scraper

    • Scrapes https://www.centralbank.go.ke/bills-bonds/treasury-bills/ for:
      • 91-day (#table_2)
      • 182-day (#table_3)
      • 364-day (#table_4)
    • Stronger DOM waits for DataTables so 91/182/364 PDF links are reliably captured
    • Downloads PDFs into downloads/bills/
  • Download deduplication (SQLite + Redis)

    • SQLite registry (data/registry.db) tracks (url, local_path, downloaded_at, source)
    • Redis set cbk:scraper:downloaded_urls prevents re-downloading already-scraped URLs
    • Idempotent runs: safe to run multiple times per day
  • Structured JSON logging

    • Shared JSON formatter in cbk_common.logging_utils
    • Scraper and OCR logs include:
      • event, pdf_url, pdf_path, source
      • file_size_bytes, pages, duration_ms (where applicable)
    • Logs written to logs/ and stdout
  • Redis metrics

    • Scraper per-run hash: cbk:scraper:run:<YYYYMMDD> (downloaded / skipped / failed)
    • OCR per-run hash: cbk:ocr:run:<YYYYMMDD> (processed / skipped / failed by source)
    • No TTL on dedup sets; TTL on run hashes (14 days) for recent history
  • OCR processing pipeline (cbk_ocr)

    • TextFirstOcrEngine using pdfplumber to extract per-page text
    • Walks downloads/bonds/ and downloads/bills/
    • Skips files already processed (Redis: cbk:ocr:processed_files)
    • Outputs:
      • Markdown: processed/markdown/{bonds,bills}/file.md
      • JSON: processed/json/{bonds,bills}/file.json with pages + metadata
    • CLI:
      • python -m cbk_ocr.run_ocr
      • python -m cbk_ocr.run_ocr --limit 5 (for testing)
  • Daily scheduling (Windows)

    • Legacy single-task script:
      • scripts/schedule_daily_windows.ps1 → one daily scraper task
    • Dual-task script:
      • scripts/schedule_daily_jobs_windows.ps1 creates:
        • CBK-Scraper-10AMpython -m cbk_scraper.run
        • CBK-OCR-12PMpython -m cbk_ocr.run_ocr
      • Times are configurable via -ScraperHour/-ScraperMinute/-OcrHour/-OcrMinute