Skip to content

Utility which can be used to scrape, in parallel, state codes and regulations for any given year.

Notifications You must be signed in to change notification settings

reglab/lawscraper

Repository files navigation

State Law Scraper

This repo contains a scraper for collecting the relevant state codes and regulations from Justia.

There are two scrapers available:

  • scraper.py: A simple, single-threaded scraper for one state at a time.
  • ms.py: A multi-threaded scraper capable of scraping multiple states in parallel, with progress bars and the ability to resume interrupted downloads.

ms_fl_scraper.py (FindLaw scraper)

Scrapes state codes from FindLaw (JS-driven pages). It launches real browsers via Playwright and streams results into a single JSONL file per state.

Usage:

python ms_fl_scraper.py <state> \
	[-o OUTPUT_DIR] [-p PROCESSES] [-t THREADS] [-c CHUNKS_PER_PROC]

Flags:

  • -o, --output-dir Directory for output (default: findlaw_codes). Writes <STATE>.jsonl inside it.
  • -p, --processes Number of browser processes (default: 6). Set to 1 for single-browser mode.
  • -t, --threads Threads per process for fetching leaf pages (default: 8).
  • -c, --chunks-per-proc Work chunk factor per process to improve progress responsiveness (default: 4).

Examples:

# Scrape New York to default folder with defaults
python ms_fl_scraper.py NY

# Scrape Pennsylvania with custom concurrency and output directory
python ms_fl_scraper.py PA -p 6 -t 8 -c 4 -o findlaw_codes

# Single-browser (more granular per-section leaf bars in the console)
python ms_fl_scraper.py KY -p 1

Notes:

  • Requires Playwright with Chromium installed. If needed: pip install playwright then playwright install chromium.
  • Progress bars: parent process shows a "Sections" bar; with -p 1 you’ll also see per-section leaf bars.

scraper.py Usage

To download the code for a single state, use

> python scraper.py CA

To download regulations, we use:

> python scraper.py CA -r

ms.py (Multi-threaded Scraper) Usage

It is recommended to use ms.py for scraping multiple states or for large states.

Scraping Specific States

To download codes for multiple states in parallel:

python ms.py --states CA TX NY

Scraping a Range of States

To scrape a range of states (alphabetically):

python ms.py --range AL AZ

Scraping All States

To scrape all available states:

python ms.py --all

Scraping Regulations

Add the -r or --regs flag to any of the above commands to download regulations instead of codes.

python ms.py --states CA TX -r

Specifying Number of Threads

You can control the number of parallel threads with the -t or --threads flag:

python ms.py --all -t 8

About

Utility which can be used to scrape, in parallel, state codes and regulations for any given year.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages