Skip to content

lauracabtay/bcorps-scraper

Repository files navigation

B Corps Scraper

CI Python 3.12+

A Python web scraper for extracting B Corporation listings from the B Corporation website, focused on UK-based companies. Uses async/await for high-performance parallel scraping.

Features

  • Async/Parallel Processing: Uses aiohttp and asyncio for concurrent scraping with configurable worker count
  • Smart Filtering: Filter companies by industry, sector, or name before scraping details
  • Comprehensive Data: Extracts company info, B Impact scores, and subscores
  • Configurable: Settings via environment variables or .env file

Installation

Prerequisites

  • Python 3.12+
  • Poetry for dependency management

Setup

# Clone the repository
git clone https://github.com/lauracabtay/bcorps-scraper.git
cd bcorps-scraper

# Install dependencies
poetry install

# Install Playwright browser (required for JavaScript-rendered pages)
poetry run playwright install chromium

Configuration (Optional)

The scraper works out of the box with sensible defaults. To customize, create a .env file or set environment variables. See SETTINGS.md for options.

Usage

Run the scraper:

poetry run bcorps-scraper

With options:

# Custom number of parallel workers (default: 15)
poetry run bcorps-scraper --workers 20

# Custom output file
poetry run bcorps-scraper --output my_data.csv

Filtering

Create a filters.toml file to filter companies:

[filters]
industries = ["Technology", "Software"]
sectors = ["B2B"]
names = ["Green", "Sustainable"]

Filters use OR logic - companies matching any criterion are included.

Output

The scraper generates a CSV file with:

Column Description
Company Name Official company name
Description Company description
Industry / Sector Business classification
Headquarters Location
Certified Since B Corp certification date
Overall Score B Impact score (80+ required)
Category Scores Governance, Workers, Community, Environment, Customers
Subscores Detailed breakdown per category
Website / URLs Company website and B Corp profile

Project Structure

bcorps-scraper/
├── src/bcorps_scraper/
│   ├── __init__.py      # Package exports
│   ├── crawler.py       # Web crawling with Playwright + aiohttp
│   ├── parser.py        # HTML parsing with BeautifulSoup
│   ├── models.py        # Data models (CompanyInfo, CompanyDetails)
│   ├── exporter.py      # CSV export functionality
│   ├── settings.py      # Pydantic settings configuration
│   └── main.py          # CLI entry point
├── tests/
│   ├── fixtures/        # HTML test fixtures
│   ├── conftest.py      # Pytest configuration
│   ├── test_crawler.py
│   ├── test_parser.py
│   ├── test_models.py
│   └── test_exporter.py
├── pyproject.toml       # Poetry + tool configuration
└── README.md

Development

Running Tests

# Run all tests
poetry run pytest

# With coverage
poetry run pytest --cov

# Verbose output
poetry run pytest -v

Code Quality

# Lint with ruff
poetry run ruff check .

# Type check with mypy
poetry run mypy src/

# Auto-fix issues
poetry run ruff check . --fix

How It Works

  1. Stage 1: Uses Playwright to navigate the paginated search results (JavaScript-rendered via Algolia)
  2. Stage 2: Scrapes individual company pages in parallel using aiohttp
  3. Stage 3: Parses HTML with BeautifulSoup to extract structured data
  4. Stage 4: Exports filtered results to CSV

Performance

  • Default: 15 concurrent workers
  • Adjustable via --workers flag
  • Typical runtime: ~5 minutes for 500 companies

Notes

  • Playwright runs in headless mode (no browser window)
  • Be mindful of rate limits when adjusting worker count
  • If the website structure changes, parsing logic may need updates

About

Python-based web scraper that collects and structures data on certified B Corps companies.

Resources

Stars

Watchers

Forks

Contributors

Languages