A Python web scraper for extracting B Corporation listings from the B Corporation website, focused on UK-based companies. Uses async/await for high-performance parallel scraping.
- Async/Parallel Processing: Uses
aiohttpandasynciofor concurrent scraping with configurable worker count - Smart Filtering: Filter companies by industry, sector, or name before scraping details
- Comprehensive Data: Extracts company info, B Impact scores, and subscores
- Configurable: Settings via environment variables or
.envfile
- Python 3.12+
- Poetry for dependency management
# Clone the repository
git clone https://github.com/lauracabtay/bcorps-scraper.git
cd bcorps-scraper
# Install dependencies
poetry install
# Install Playwright browser (required for JavaScript-rendered pages)
poetry run playwright install chromiumThe scraper works out of the box with sensible defaults. To customize, create a .env file or set environment variables. See SETTINGS.md for options.
Run the scraper:
poetry run bcorps-scraperWith options:
# Custom number of parallel workers (default: 15)
poetry run bcorps-scraper --workers 20
# Custom output file
poetry run bcorps-scraper --output my_data.csvCreate a filters.toml file to filter companies:
[filters]
industries = ["Technology", "Software"]
sectors = ["B2B"]
names = ["Green", "Sustainable"]Filters use OR logic - companies matching any criterion are included.
The scraper generates a CSV file with:
| Column | Description |
|---|---|
| Company Name | Official company name |
| Description | Company description |
| Industry / Sector | Business classification |
| Headquarters | Location |
| Certified Since | B Corp certification date |
| Overall Score | B Impact score (80+ required) |
| Category Scores | Governance, Workers, Community, Environment, Customers |
| Subscores | Detailed breakdown per category |
| Website / URLs | Company website and B Corp profile |
bcorps-scraper/
├── src/bcorps_scraper/
│ ├── __init__.py # Package exports
│ ├── crawler.py # Web crawling with Playwright + aiohttp
│ ├── parser.py # HTML parsing with BeautifulSoup
│ ├── models.py # Data models (CompanyInfo, CompanyDetails)
│ ├── exporter.py # CSV export functionality
│ ├── settings.py # Pydantic settings configuration
│ └── main.py # CLI entry point
├── tests/
│ ├── fixtures/ # HTML test fixtures
│ ├── conftest.py # Pytest configuration
│ ├── test_crawler.py
│ ├── test_parser.py
│ ├── test_models.py
│ └── test_exporter.py
├── pyproject.toml # Poetry + tool configuration
└── README.md
# Run all tests
poetry run pytest
# With coverage
poetry run pytest --cov
# Verbose output
poetry run pytest -v# Lint with ruff
poetry run ruff check .
# Type check with mypy
poetry run mypy src/
# Auto-fix issues
poetry run ruff check . --fix- Stage 1: Uses Playwright to navigate the paginated search results (JavaScript-rendered via Algolia)
- Stage 2: Scrapes individual company pages in parallel using aiohttp
- Stage 3: Parses HTML with BeautifulSoup to extract structured data
- Stage 4: Exports filtered results to CSV
- Default: 15 concurrent workers
- Adjustable via
--workersflag - Typical runtime: ~5 minutes for 500 companies
- Playwright runs in headless mode (no browser window)
- Be mindful of rate limits when adjusting worker count
- If the website structure changes, parsing logic may need updates