Skip to content

Conversation

@irenecasado
Copy link
Collaborator

@irenecasado irenecasado commented Feb 12, 2025

Description

Here is the Vallejo PD scraper, the complexity of this scraper is linked to the multiple levels of nesting.

Notes

  • This scraper is built with Playwright.

@stucka
Copy link
Contributor

stucka commented Apr 3, 2025

@irenecasado , do you want me to look this one over?

@irenecasado
Copy link
Collaborator Author

sure!

@stucka stucka self-requested a review April 9, 2025 20:51
@stucka stucka marked this pull request as ready for review April 9, 2025 20:51
@stucka
Copy link
Contributor

stucka commented Apr 9, 2025

These are Mike's notes, not intended to be for anyone else, likely won't make sense to him the next time either

Fixed: Use utils JSON export

Interesting failure mode:
2025-04-09 18:25:42,752 - root - The following folders were not able to be scraped:
2025-04-09 18:25:42,752 - root - - Loading...

Initial page load time was too slow and scraper failed. I wonder if it might be worth maybe doing a quick poke with requests or something, then try to load the page for real? Or just increase the time.

Look into pagination processing ("Attempting to navigate" ... there may be a more graceful way of doing this. Or maybe not. Potentially can save 10 seconds per case and remove a bunch of error messages along the way.

Sometimes returns vimeo.com .../folder/ ... should get the same kind of treatment as the YouTube code; see issue #193 .

Current case_id should be dropped into details as raw_case_id or some such. case_id might could be extracted with case_id.split("\n")[0].strip().

details should be a subfolder but is not; some lower elements should move into it.

current name field seems to be the title. Need to get a real filename going.

Do QA checks to verify some cases vs. actual returns. The duplication around a title of Download makes me wonder.

2025-04-09 17:01:22,628 - root - Processing page 2 of folder: Use of Force Resulting in Death or GBI > 2024-11-12 Release
2 Item(s) > VPD Case 12-11085
16 Item(s)
2025-04-09 17:01:22,661 - root - Found 12 file elements on https://www.vallejopd.net/public_information/codes_policies/penal_code_832_7__sb1421_ in folder 'VPD Case 12-11085
16 Item(s)'.
2025-04-09 17:01:22,710 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,765 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,841 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,917 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:22,987 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:23,081 - root - Skipping duplicate file with title 'Download'.
2025-04-09 17:01:23,105 - root - Subfolder count on current page: 0
2025-04-09 17:01:23,106 - root - No subfolders to process on this page.
2025-04-09 17:01:23,205 - root - Active page determined: 2
2025-04-09 17:01:23,205 - root - Attempting to navigate from page 2 to page 3.
2025-04-09 17:01:33,234 - root - Error navigating to the next page: Page.wait_for_selector: Timeout 10000ms exceeded.
Call log:
  - waiting for locator("a.pageButton:has-text('3')") to be visible
  -     24 × locator resolved to hidden <a tag="2" href="#" title="3 Page" class="pageButton number">3 </a>

@newsroomdev newsroomdev requested a review from Copilot April 23, 2025 20:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new web scraper for the Vallejo Police Department using Playwright. It implements recursive folder and subfolder processing, file pagination, and metadata extraction which is later saved as a JSON file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants