Skip to content

Scraper resilience: HTML structure change detection #61

@AndreRobitaille

Description

@AndreRobitaille

Summary

The scraper jobs parse the Two Rivers city website HTML to discover meetings and extract agenda data. If the city website changes its HTML structure (new CMS, redesign, layout tweaks), the scrapers will silently produce incorrect or empty results with no alerting.

Current Risk

  • Scrapers::DiscoverMeetingsJob parses table rows from the meetings listing page
  • Scrapers::ParseMeetingPageJob parses detail page structure for agenda items, documents, motions
  • Scrapers::ParseAgendaJob parses agenda HTML format
  • None of these have tests or structural validation
  • A website change could result in: zero meetings discovered, missing documents, lost agenda items — all silently

Proposed Mitigations

1. Structural Assertions in Scraper Jobs

Add validation checks that raise/log warnings when expected HTML elements are missing:

  • Meetings page: Assert table with expected columns exists
  • Detail page: Assert expected sections (agenda, documents, motions) are present
  • Log warnings when a scrape run produces zero results or significantly fewer results than previous runs

2. Canary Test (Integration)

A test that hits the live Two Rivers website and validates that the HTML structure matches what the scrapers expect:

  • Run periodically (not on every CI run — too slow and fragile)
  • Validates: page loads, expected table structure exists, at least N meetings found
  • Can be triggered manually: bin/rails test test/integration/scraper_canary_test.rb

3. Monitoring/Alerting

  • Track meetings discovered per scrape run
  • Alert if a run discovers zero meetings (likely structural change)
  • Alert if document download rate drops significantly
  • Consider a simple admin dashboard metric or Solid Queue job failure tracking

4. Fixture-Based Regression Tests

  • Save snapshots of real HTML pages as test fixtures
  • Run scraper parsing against fixtures to catch regressions
  • Update fixtures when intentional changes are made

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions