StatsCan and IRCC data warehouse with LibreChat + FastMCP for natural language querying via Athena.
src/
statscan/ # StatsCan data pipeline
discover.py # Discover datasets via API
ingest.py # Download and convert to parquet
upload.py # Upload to S3
catalog.py # Update catalog availability
crawler.py # Update Glue crawler
utils.py # S3 utilities
mcp/ # MCP server for Athena
athena_mcp_server.py
docker/ # Docker deployment (Dockerfile, docker-compose.yml)
tests/ # Test suite (FC/IS architecture)
hooks/ # Git hooks (pre-push)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txtEnable pre-push hook to run tests before pushing:
ln -s ../../hooks/pre-push .git/hooks/pre-pushThis runs tests locally before pushing to catch issues immediately.
Create a .env file or export directly:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-2All pipeline scripts are Python modules:
# Discover datasets
python -m src.statscan.discover
# Ingest datasets (optional LIMIT env var)
LIMIT=5 python -m src.statscan.ingest
# Upload to S3
python -m src.statscan.upload
# Update catalog
python -m src.statscan.catalog
# Update Glue crawler
python -m src.statscan.crawlerpytest --cov=src --cov-report=term-missing --cov-fail-under=65Use the helper script to rebuild and run the pipeline:
# Test with limited datasets
./run-docker.sh 5
# Production run (all datasets)
./run-docker.sh
# Help
./run-docker.sh --helpThe script automatically rebuilds the Docker image and runs the full pipeline.
If you need more control, use docker compose directly:
# Rebuild image
docker compose -f docker/docker-compose.yml build
# Run pipeline with limit
LIMIT=5 docker compose -f docker/docker-compose.yml up
# Run pipeline (all datasets)
docker compose -f docker/docker-compose.yml upThe pipeline executes in sequence:
discover- Fetch catalog from StatsCan APIingest- Download and convert CSVs to parquetupload- Upload to S3catalog- Update catalog availabilitycrawler- Sync Glue crawler with S3
LIMIT- Number of datasets to process (optional, for testing)AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION- AWS credentials