© Jesus Villota Miranda 2025. All rights reserved.
A comprehensive tool for scraping central bank speeches published by the Bank for International Settlements (BIS). This project provides an automated way to collect, download, and extract text from speech PDFs available on the BIS website.
The BIS Scraper is designed to automate the collection of central bank speeches for research, analysis, and archival purposes. It navigates through the BIS website's speech repository, extracts links to speech PDFs, downloads them, and optionally converts them to plain text for easier analysis.
- Automated scraping of BIS central bank speeches
- PDF downloading with configurable parameters
- Text extraction from downloaded PDFs
- Fully configurable via YAML configuration
- Command-line interface for easy execution
BIS_Scraper/
├── bis_scraper/ # Core package containing scraping logic
│ ├── __init__.py # Package exports
│ ├── config_loader.py # Configuration handling
│ ├── scraper.py # Main scraping implementation
│ └── pdf_extractor.py # PDF text extraction utilities
├── downloads/ # Default directory for downloaded PDFs
├── texts/ # Directory for extracted text files
├── config.yaml # Configuration file
├── main.py # CLI entry point
├── pyproject.toml # Project metadata and dependencies
├── poetry.lock # Dependency lock file
├── requirements.txt # Traditional requirements file
└── README.md # This documentation
# Install dependencies using Poetry
poetry install
# Activate the virtual environment
poetry shellpip install -r requirements.txtEdit the config.yaml file to customize the scraper's behavior:
BASE_URL: "https://www.bis.org" # Base URL for the BIS website
DOWNLOAD_DIR: "downloads" # Directory for saving PDFs
TEXT_DIR: "texts" # Directory for saving extracted text files
INITIAL_DATE: "01/01/2000" # Start date for speeches (MM/DD/YYYY)
FINAL_DATE: "11/08/2025" # End date for speeches (MM/DD/YYYY)
PAGE_LENGTH: 10 # Number of results per page
MAX_PAGE: 2 # Maximum number of pages to scrapeRun the scraper with the default configuration:
poetry run python main.pyExtract text from downloaded PDFs:
poetry run python main.py --extract-textExtract text from specific PDF files:
poetry run python main.py --extract-text --pdfs r250715b.pdf r250717h.pdfTest the URL generation functionality:
poetry run python main.py --test-linkOr specify a custom configuration file:
poetry run python main.py --config custom_config.yaml- Python 3.11 or higher
- Google Chrome installed
- ChromeDriver available on your system PATH (required for Selenium)
- selenium: For browser automation
- requests: For downloading PDFs
- PyYAML: For configuration parsing
- PyPDF2: For PDF text extraction
For questions or contributions, feel free to open an issue or pull request.