-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
The scraper jobs parse the Two Rivers city website HTML to discover meetings and extract agenda data. If the city website changes its HTML structure (new CMS, redesign, layout tweaks), the scrapers will silently produce incorrect or empty results with no alerting.
Current Risk
Scrapers::DiscoverMeetingsJobparses table rows from the meetings listing pageScrapers::ParseMeetingPageJobparses detail page structure for agenda items, documents, motionsScrapers::ParseAgendaJobparses agenda HTML format- None of these have tests or structural validation
- A website change could result in: zero meetings discovered, missing documents, lost agenda items — all silently
Proposed Mitigations
1. Structural Assertions in Scraper Jobs
Add validation checks that raise/log warnings when expected HTML elements are missing:
- Meetings page: Assert table with expected columns exists
- Detail page: Assert expected sections (agenda, documents, motions) are present
- Log warnings when a scrape run produces zero results or significantly fewer results than previous runs
2. Canary Test (Integration)
A test that hits the live Two Rivers website and validates that the HTML structure matches what the scrapers expect:
- Run periodically (not on every CI run — too slow and fragile)
- Validates: page loads, expected table structure exists, at least N meetings found
- Can be triggered manually:
bin/rails test test/integration/scraper_canary_test.rb
3. Monitoring/Alerting
- Track meetings discovered per scrape run
- Alert if a run discovers zero meetings (likely structural change)
- Alert if document download rate drops significantly
- Consider a simple admin dashboard metric or Solid Queue job failure tracking
4. Fixture-Based Regression Tests
- Save snapshots of real HTML pages as test fixtures
- Run scraper parsing against fixtures to catch regressions
- Update fixtures when intentional changes are made
Related
- Scraper: ensure daily run parses recent meetings #23 Scraper: ensure daily run parses recent meetings
- Test coverage: Scraper jobs (ParseMeetingPage, ParseAgenda) #54 (Test coverage: Scraper jobs) — unit tests for scraper jobs
- Add daily cron to run meeting discovery once #20 Add Fly.io daily cron to run meeting discovery once
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels