vallejo scraper #199

irenecasado · 2025-02-12T23:27:19Z

Description

Here is the Vallejo PD scraper, the complexity of this scraper is linked to the multiple levels of nesting.

Notes

This scraper is built with Playwright.

stucka · 2025-04-03T23:58:07Z

@irenecasado , do you want me to look this one over?

irenecasado · 2025-04-04T22:54:26Z

sure!

stucka · 2025-04-09T21:21:39Z

These are Mike's notes, not intended to be for anyone else, likely won't make sense to him the next time either

Fixed: Use utils JSON export

Interesting failure mode:
2025-04-09 18:25:42,752 - root - The following folders were not able to be scraped:
2025-04-09 18:25:42,752 - root - - Loading...

Initial page load time was too slow and scraper failed. I wonder if it might be worth maybe doing a quick poke with requests or something, then try to load the page for real? Or just increase the time.

Look into pagination processing ("Attempting to navigate" ... there may be a more graceful way of doing this. Or maybe not. Potentially can save 10 seconds per case and remove a bunch of error messages along the way.

Sometimes returns vimeo.com .../folder/ ... should get the same kind of treatment as the YouTube code; see issue #193 .

Current case_id should be dropped into details as raw_case_id or some such. case_id might could be extracted with case_id.split("\n")[0].strip().

details should be a subfolder but is not; some lower elements should move into it.

current name field seems to be the title. Need to get a real filename going.

Do QA checks to verify some cases vs. actual returns. The duplication around a title of Download makes me wonder.

2025-04-09 17:01:22,628 - root - Processing page 2 of folder: Use of Force Resulting in Death or GBI > 2024-11-12 Release
2 Item(s) > VPD Case 12-11085
16 Item(s)
2025-04-09 17:01:22,661 - root - Found 12 file elements on https://www.vallejopd.net/public_information/codes_policies/penal_code_832_7__sb1421_ in folder 'VPD Case 12-11085
16 Item(s)'.
2025-04-09 17:01:22,710 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,765 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,841 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,917 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,987 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:23,081 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:23,105 - root - Subfolder count on current page: 0
2025-04-09 17:01:23,106 - root - No subfolders to process on this page.
2025-04-09 17:01:23,205 - root - Active page determined: 2
2025-04-09 17:01:23,205 - root - Attempting to navigate from page 2 to page 3.
2025-04-09 17:01:33,234 - root - Error navigating to the next page: Page.wait_for_selector: Timeout 10000ms exceeded.
Call log:
  - waiting for locator("a.pageButton:has-text('3')") to be visible
  -     24 × locator resolved to hidden <a tag="2" href="#" title="3 Page" class="pageButton number">3 </a>

Copilot

Pull Request Overview

This PR introduces a new web scraper for the Vallejo Police Department using Playwright. It implements recursive folder and subfolder processing, file pagination, and metadata extraction which is later saved as a JSON file.

vallejo scraper

d01dc34

irenecasado assigned newsroomdev and tarakc02 Feb 12, 2025

stucka added 2 commits April 3, 2025 19:58

Merge branch 'dev' into vallejo_pd

1b82542

Merge branch 'dev' into vallejo_pd

b3c1831

Merge branch 'dev' into vallejo_pd

493a2b3

stucka self-requested a review April 9, 2025 20:51

stucka marked this pull request as ready for review April 9, 2025 20:51

Use built-in json_write

2a5540d

newsroomdev requested a review from Copilot April 23, 2025 20:18

Copilot AI reviewed Apr 23, 2025

View reviewed changes

Merge branch 'dev' into vallejo_pd

d040f54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vallejo scraper #199

vallejo scraper #199

Uh oh!

irenecasado commented Feb 12, 2025 •

edited

Loading

Uh oh!

stucka commented Apr 3, 2025

Uh oh!

irenecasado commented Apr 4, 2025

Uh oh!

stucka commented Apr 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vallejo scraper #199

Are you sure you want to change the base?

vallejo scraper #199

Uh oh!

Conversation

irenecasado commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Notes

Uh oh!

stucka commented Apr 3, 2025

Uh oh!

irenecasado commented Apr 4, 2025

Uh oh!

stucka commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

irenecasado commented Feb 12, 2025 •

edited

Loading

stucka commented Apr 9, 2025 •

edited

Loading