B Corps Scraper

A Python web scraper for extracting B Corporation listings from the B Corporation website, focused on UK-based companies. Uses async/await for high-performance parallel scraping.

Features

Async/Parallel Processing: Uses aiohttp and asyncio for concurrent scraping with configurable worker count
Smart Filtering: Filter companies by industry, sector, or name before scraping details
Comprehensive Data: Extracts company info, B Impact scores, and subscores
Configurable: Settings via environment variables or .env file

Installation

Prerequisites

Python 3.12+
Poetry for dependency management

Setup

# Clone the repository
git clone https://github.com/lauracabtay/bcorps-scraper.git
cd bcorps-scraper

# Install dependencies
poetry install

# Install Playwright browser (required for JavaScript-rendered pages)
poetry run playwright install chromium

Configuration (Optional)

The scraper works out of the box with sensible defaults. To customize, create a .env file or set environment variables. See SETTINGS.md for options.

Usage

Run the scraper:

poetry run bcorps-scraper

With options:

# Custom number of parallel workers (default: 15)
poetry run bcorps-scraper --workers 20

# Custom output file
poetry run bcorps-scraper --output my_data.csv

Filtering

Create a filters.toml file to filter companies:

[filters]
industries = ["Technology", "Software"]
sectors = ["B2B"]
names = ["Green", "Sustainable"]

Filters use OR logic - companies matching any criterion are included.

Output

The scraper generates a CSV file with:

Column	Description
Company Name	Official company name
Description	Company description
Industry / Sector	Business classification
Headquarters	Location
Certified Since	B Corp certification date
Overall Score	B Impact score (80+ required)
Category Scores	Governance, Workers, Community, Environment, Customers
Subscores	Detailed breakdown per category
Website / URLs	Company website and B Corp profile

Project Structure

bcorps-scraper/
├── src/bcorps_scraper/
│   ├── __init__.py      # Package exports
│   ├── crawler.py       # Web crawling with Playwright + aiohttp
│   ├── parser.py        # HTML parsing with BeautifulSoup
│   ├── models.py        # Data models (CompanyInfo, CompanyDetails)
│   ├── exporter.py      # CSV export functionality
│   ├── settings.py      # Pydantic settings configuration
│   └── main.py          # CLI entry point
├── tests/
│   ├── fixtures/        # HTML test fixtures
│   ├── conftest.py      # Pytest configuration
│   ├── test_crawler.py
│   ├── test_parser.py
│   ├── test_models.py
│   └── test_exporter.py
├── pyproject.toml       # Poetry + tool configuration
└── README.md

Development

Running Tests

# Run all tests
poetry run pytest

# With coverage
poetry run pytest --cov

# Verbose output
poetry run pytest -v

Code Quality

# Lint with ruff
poetry run ruff check .

# Type check with mypy
poetry run mypy src/

# Auto-fix issues
poetry run ruff check . --fix

How It Works

Stage 1: Uses Playwright to navigate the paginated search results (JavaScript-rendered via Algolia)
Stage 2: Scrapes individual company pages in parallel using aiohttp
Stage 3: Parses HTML with BeautifulSoup to extract structured data
Stage 4: Exports filtered results to CSV

Performance

Default: 15 concurrent workers
Adjustable via --workers flag
Typical runtime: ~5 minutes for 500 companies

Notes

Playwright runs in headless mode (no browser window)
Be mindful of rate limits when adjusting worker count
If the website structure changes, parsing logic may need updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

B Corps Scraper

Features

Installation

Prerequisites

Setup

Configuration (Optional)

Usage

Filtering

Output

Project Structure

Development

Running Tests

Code Quality

How It Works

Performance

Notes

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src/bcorps_scraper		src/bcorps_scraper
tests		tests
.coverage		.coverage
.env		.env
.gitignore		.gitignore
README.md		README.md
SETTINGS.md		SETTINGS.md
filters.toml		filters.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

B Corps Scraper

Features

Installation

Prerequisites

Setup

Configuration (Optional)

Usage

Filtering

Output

Project Structure

Development

Running Tests

Code Quality

How It Works

Performance

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages