A two-process data pipeline that extracts GitHub Archive events and processes them independently for high availability.
The system consists of two independent processes with separate schedulers to avoid single points of failure:
Cloud Scheduler → Cloud Function → BigQuery → Pub/Sub Topic
- Runs hourly at
:00(1:00, 2:00, 3:00...) - Queries GitHub Archive dayparted tables (
githubarchive.day.YYYYMMDD) - Filters for: PullRequestEvent, IssuesEvent, ReleaseEvent, PushEvent
- Publishes events to Pub/Sub topic with 2-hour processing buffer
GitHub Actions Cron → Python Processor → Pub/Sub Subscription
- Runs hourly at
:15(1:15, 2:15, 3:15...) - 15 minute offset - Pulls all available messages from subscription
- Processes events by type with individual ACKing
- Threaded processing for performance
- Independent Processes: Each has its own scheduler and failure domain
- No Single Point of Failure: Process failures are isolated
- Robust Message Handling: Individual ACKing, proper error handling
- Scalable Processing: Threaded Python processor handles all available messages
- Cost Effective: ~$2/month for typical usage
-
Deploy Process 1 (Data Extraction):
# Set GitHub repository secret: GCP_SA_KEY # Run "Deploy to GCP" GitHub Action
-
Process 2 runs automatically via GitHub Actions cron
PROJECT_ID=evm-attest- GCP projectPUBSUB_TOPIC_ID=github-events- Pub/Sub topic nameHOURS_BEHIND=2- Query offset (default: 2 hours)BATCH_SIZE=100- Pub/Sub batch size
GCP_SA_KEY- Service account JSON key for deployment and processing
- Add the
GCP_SA_KEYsecret to your GitHub repository - Run the "Deploy to GCP" GitHub Action workflow
- Process 2 will start automatically on the next
:15interval
export PROJECT_ID=evm-attest
export PUBSUB_TOPIC_ID=github-events
./deploy.shThe Python processor (process_messages.py) handles events by type:
- PullRequestEvent: PR creation, updates, merges
- IssuesEvent: Issue creation, closing, comments
- ReleaseEvent: New releases, tags
- PushEvent: Code commits
Extend the _process_* methods to add custom processing logic.
gcloud functions logs read github-events-etl --region=us-central1
gcloud scheduler jobs run github-etl-hourly --location=us-central1- View workflow runs in GitHub Actions tab
- Check logs for message processing details
# Check topic and subscription status
gcloud pubsub topics describe github-events
gcloud pubsub subscriptions describe github-events-processor- Process 1 fails: Messages stop flowing, Process 2 processes remaining backlog
- Process 2 fails: Messages accumulate in Pub/Sub, will be processed when recovered
- Both fail: No data loss, messages retained in Pub/Sub (7-day retention)
- Cloud Functions: ~$1.20/month (hourly execution)
- Cloud Scheduler: Free (≤3 jobs)
- Pub/Sub: ~$0.40/month (1M messages)
- BigQuery: Minimal (queries public dataset)
- GitHub Actions: Free (public repo)
Total: ~$2/month