A web scraper and REST API for New Mexico Oil Conservation Division (OCD) well data.
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies and set up environment
uv sync
uv run playwright install chromiumpython -m venv .venv
source .venv/bin/activate
pip install -e .
playwright install chromiumdocker compose up --build
OR , if venv or UV venv activated:
[UV run] python -m well_db docker up --buildThis starts the API server at http://localhost:8000. The Playwright browser is included in the image.
Populate the database from the CSV file containing API numbers:
# Using uv
uv run python -m well_db scrape
# With options
uv run python -m well_db scrape --concurrency 3 --missingThe --missing flag only scrapes APIs not already in the database. Use --force to re-scrape everything.
uv run python -m well_db serveServer runs at http://127.0.0.1:8000. API docs available at /docs.
# Use the test polygon from the assignment
uv run python -m well_db polygon test -o results.csv
# Custom polygon
uv run python -m well_db polygon "[(32.81,-104.19),(32.66,-104.32),(32.54,-104.24)]" -o results.csvuv run python -m well_db delete --yes # Delete the database
uv run python -m well_db docker up # Start Docker containers
uv run python -m well_db docker down # Stop Docker containers| Method | Endpoint | Description |
|---|---|---|
| GET | /well?api_number=XX-XXX-XXXXX |
Get all data for a single well |
| GET | /polygon-search?polygon=[(lat,lon),...] |
Find wells within a polygon |
| Method | Endpoint | Description |
|---|---|---|
| GET | /db/status |
Database status and CSV comparison |
| POST | /scrape/start |
Start background scrape job |
| GET | /scrape/status |
Monitor scrape progress |
| POST | /scrape/stop |
Stop running scrape job |
| GET | /wells |
List all wells (paginated) |
| GET | /wells/count |
Total well count |
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Health check |
| GET | /well/scrape?api_number=... |
Force scrape a single well |
| GET | /wells/random |
Get a random well (scrapes if missing) |
| POST | /polygon-search |
Polygon search with JSON body |
The scraper uses Playwright with headless Chromium.
Key scraper features:
- Element-based waiting: Uses
wait_for_selector("#general_information")[General Well Information] instead of fixed delays for reliable page load detection - Targeted extraction: Extracts only the
fieldset.data_containerelement rather than full page text - Worker pool concurrency: Uses
asyncio.Queuewith configurable workers instead of semaphore-based approaches
I attempted to use asyncio.gather() with a semaphore, which caused request clustering. The final implementation uses a worker pool pattern where N workers pull from a shared queue, each maintaining a 1.5-second delay between their own requests. This provides consistent rate limiting without overwhelming the target server.
Uses GeoPandas with WGS84 CRS (EPSG:4326) for geodetically-correct point-in-polygon testing. A bounding box pre-filter in SQL reduces the candidate set before the spatial join.
Field ordering in the SQLAlchemy model matches the assignment specification. The api field serves as the primary key. Timestamps use datetime.now(timezone.utc) with lambda wrappers to avoid the deprecated datetime.utcnow().
The scraper implements exponential backoff with 3 retries per API. Failed APIs are tracked and reported but don't halt the batch. The API's /scrape/start endpoint runs scraping in a background task with status polling via /scrape/status.