A Python tool for automated data collection and processing from the SurvStat@RKI platform, Germany's interactive infectious disease surveillance system.
This project automates the download and processing of disease surveillance data from the Robert Koch Institute (RKI). Using Selenium, it downloads data for specified diseases and years, organized by week and German administrative districts (Kreise). The data is then processed and harmonized using regional identification codes.
For a walkthrough of the pipleine in this rep, have a look at Extracting_survstat_data.ipynb, in src.
- Automated Data Collection: Downloads disease data from SurvStat@RKI for multiple diseases and years
- Data Processing: Merges yearly files into standardized datasets
- Geographic Harmonization: Translates district names to official regional codes (Kreiskennziffern)
- Flexible Configuration: Easy configuration through YAML files
- DataProcessingOrchestrator: A (large and generic) class for chaining data processing steps
- Clone the repository:
git clone https://github.com/stends2001/survstat_data.git
cd survstat_loader- Install dependencies:
pip install -r requirements.txt- Configure the project by editing
config.yamlif needed.
Run the main script to download and process current year data:
python src/update_survstatdata.pyfrom src.survstat_collecting.survstat_scraper import scrape_survstat_data
from src.survstat_collecting.casedata_processing import preprocess_survstat_data
# Update current datafiles for the current year
scrape_survstat_data(
disease_names={'campylobacter': 'Campylobacter'},
years='2025',
output_directory=directories_dict['dir_data_raw'],
downloads_directory=directories_dict['dir_downloads']
)
# Process the downloaded data
preprocess_survstat_data(
diseases=['Campylobacter'],
years='2025',
raw_data_dir=directories_dict['dir_data_raw'],
processed_data_dir=directories_dict['dir_data_preprocessed'],
how='update'
)To see the tool in action with sample data:
- Generate sample data (if you have real data):
python src/create_github_sample.py-
View the demo notebook:
- Open
src/Demo_measles_visualization.ipynbto see a visualization of national measles data - The sample data contains real national measles data aggregated from the SurvStat system
- Open
-
Sample data structure:
data/sample/measles_national.csv: National weekly measles cases- Contains:
timestamp,casescolumns - Safe to share on GitHub (national-level data only)
survstat_loader/
├── src/
│ ├── dataprocessor/ # Data processing modules
│ ├── survstat_collecting/ # Web scraping and data collection
│ ├── utils/ # Utility functions
│ ├── update_survstatdata.py # Main execution script
│ ├── preview_epicurve.py # Show weekly casenumbers of downloaded data
│ ├── create_github_sample.py # Create sample data for GitHub
│ └── Demo_measles_visualization.ipynb # Demo notebook
├── data/ # Data storage (not in git)
│ ├── raw/ # Raw downloaded files
│ ├── preprocessed/ # Processed datasets
│ ├── harmonization/ # Geographic mapping files
│ └── sample/ # Sample data for GitHub (committed)
├── config.yaml # Configuration file
└── requirements.txt # Python dependencies
- Real data: Stored in
data/directories and excluded from Git - Sample data: Only national-level aggregated data is shared
- Regional data: Contains district-level information and is kept private
- Full datasets: Multiple diseases and years are excluded from repository
Edit config.yaml to customize:
- Data directory paths
- Download locations
- Processing options
This project is licensed under the MIT License - see the LICENSE file for details.