This repo contains a scraper for collecting the relevant state codes and regulations from Justia.
There are two scrapers available:
scraper.py: A simple, single-threaded scraper for one state at a time.ms.py: A multi-threaded scraper capable of scraping multiple states in parallel, with progress bars and the ability to resume interrupted downloads.
Scrapes state codes from FindLaw (JS-driven pages). It launches real browsers via Playwright and streams results into a single JSONL file per state.
Usage:
python ms_fl_scraper.py <state> \
[-o OUTPUT_DIR] [-p PROCESSES] [-t THREADS] [-c CHUNKS_PER_PROC]Flags:
-o, --output-dirDirectory for output (default:findlaw_codes). Writes<STATE>.jsonlinside it.-p, --processesNumber of browser processes (default: 6). Set to1for single-browser mode.-t, --threadsThreads per process for fetching leaf pages (default: 8).-c, --chunks-per-procWork chunk factor per process to improve progress responsiveness (default: 4).
Examples:
# Scrape New York to default folder with defaults
python ms_fl_scraper.py NY
# Scrape Pennsylvania with custom concurrency and output directory
python ms_fl_scraper.py PA -p 6 -t 8 -c 4 -o findlaw_codes
# Single-browser (more granular per-section leaf bars in the console)
python ms_fl_scraper.py KY -p 1Notes:
- Requires Playwright with Chromium installed. If needed:
pip install playwrightthenplaywright install chromium. - Progress bars: parent process shows a "Sections" bar; with
-p 1you’ll also see per-section leaf bars.
To download the code for a single state, use
> python scraper.py CATo download regulations, we use:
> python scraper.py CA -rIt is recommended to use ms.py for scraping multiple states or for large states.
To download codes for multiple states in parallel:
python ms.py --states CA TX NYTo scrape a range of states (alphabetically):
python ms.py --range AL AZTo scrape all available states:
python ms.py --allAdd the -r or --regs flag to any of the above commands to download regulations instead of codes.
python ms.py --states CA TX -rYou can control the number of parallel threads with the -t or --threads flag:
python ms.py --all -t 8