-
Notifications
You must be signed in to change notification settings - Fork 10
vallejo scraper #199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
vallejo scraper #199
Conversation
|
@irenecasado , do you want me to look this one over? |
|
sure! |
|
These are Mike's notes, not intended to be for anyone else, likely won't make sense to him the next time either Fixed: Use utils JSON export Interesting failure mode: Initial page load time was too slow and scraper failed. I wonder if it might be worth maybe doing a quick poke with requests or something, then try to load the page for real? Or just increase the time. Look into pagination processing ("Attempting to navigate" ... there may be a more graceful way of doing this. Or maybe not. Potentially can save 10 seconds per case and remove a bunch of error messages along the way. Sometimes returns vimeo.com .../folder/ ... should get the same kind of treatment as the YouTube code; see issue #193 . Current case_id should be dropped into details as raw_case_id or some such. case_id might could be extracted with case_id.split("\n")[0].strip(). details should be a subfolder but is not; some lower elements should move into it. current name field seems to be the title. Need to get a real filename going. Do QA checks to verify some cases vs. actual returns. The duplication around a title of Download makes me wonder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new web scraper for the Vallejo Police Department using Playwright. It implements recursive folder and subfolder processing, file pagination, and metadata extraction which is later saved as a JSON file.
Description
Here is the Vallejo PD scraper, the complexity of this scraper is linked to the multiple levels of nesting.
Notes