Vapor Source is a full-stack system for scraping, enriching, and exploring software job postings. The current implementation is a cost-minimized Go serverless pipeline; the original Swift + Vapor stack remains in backend/swift/ as legacy. The pipeline now feeds a React insights frontend in frontend/vapor-source/, which renders charts and tables directly against the published snapshots. My main motivation for this project is job boards rarely expose reliable filters for years of experience or true remote eligibility.
- Web scraping: Go Lambda crawls
worksourcewa.com. - AI-powered enrichment: OpenAI structured outputs normalize modality, domain, degree, skills, and years of experience.
- Canonical storage: Jobs are deduped and stored in DynamoDB (
JobIdPK,PostedDatesort key,PostedDate-IndexGSI). - Job ID cache: In-memory dedupe set is seeded from the S3
job-ids.txtand persisted back, so runs remain idempotent across invocations. - Snapshot export: Snapshot Lambda writes per-day JSONL files to S3 and refreshes
snapshot-manifest.jsonfor consumers (fronted by CloudFront). - Legacy (Swift/Vapor): Kept for reference; no longer the canonical path.
EventBridge (cron)
↓
Scraper Lambda (Go)
↕
In-memory dedupe set ⇄ S3 job ID cache (`job-ids.txt`)
↓
DynamoDB (JobId PK, PostedDate SK; GSI PostedDate-Index)
↓
Snapshot Lambda (Go)
↓
S3 (per-day JSONL + snapshot-manifest.json) → CloudFront (proxy)
↓
React Insights UI (fetches /snapshots/*)
- Scraper Lambda handles crawling, OpenAI enrichment, S3-synced job ID cache seeding/persisting, dedupe, and writes to DynamoDB.
- When new rows land, the scraper invokes the snapshot Lambda to emit per-day JSONL files.
- CloudFront serves snapshot artifacts without exposing the S3 bucket, and the React UI reads those artifacts for dashboards.
- Languages: Go (current serverless), Swift (legacy)
- AI & parsing: OpenAI structured outputs
- Data plane: DynamoDB, S3, CloudFront
- Compute: AWS Lambda (+ EventBridge triggers)
- IaC: Terraform (
infra/terraform/go-serverless) - Frontend: React 19 + TypeScript + Vite, Tailwind, TanStack Table, Plotly.js (
frontend/vapor-source/) - Hosting: Cloudflare Pages
- Legacy: Swift Vapor API + PostgreSQL (retained in
backend/swift/)
backend/go/: Go Lambdas (cmd/scraper,cmd/snapshot,cmd/local) and shared libs.backend/swift/: Legacy Swift Lambda + Vapor server.frontend/vapor-source/: React UI that reads the published snapshots and renders charts/tables.infra/terraform/go-serverless/: Terraform for the Go stack (Lambdas, DynamoDB, S3, CloudFront, EventBridge).
The frontend/vapor-source/ app is a static React + TypeScript SPA. It fetches snapshot-manifest.json plus per-day JSONL exports from /snapshots/, filters to software-engineering rows, and renders:
- Plotly visualizations (YOE box plots, modality/domain/degree frequency, skills beeswarm, etc.) that aggregate across the fetched days.
- Latest jobs table powered by TanStack Table with client-side sorting/filters.
cd frontend/vapor-source && npm ci.- Point the dev server at real snapshot data either by copying assets into
public/snapshots/(e.g.,aws s3 sync s3://<bucket>/<prefix> public/snapshots) or by settingVITE_SNAPSHOT_PROXY_TARGET=https://<cloudfront-domain>/snapshotsso Vite proxies/snapshots/*to CloudFront. npm run devto launch the UI at http://localhost:5173; it will fetch/snapshots/snapshot-manifest.jsonvia the option you chose above.npm run build && npm run previewto validate the production bundle locally.
Important
Deployment details:
functions/snapshots/[[path]].tsis the Cloudflare Pages Function that proxies/snapshots/*to the CloudFront distribution; bindCLOUDFRONT_BASE_URLto that origin.- Because the UI is static, no backend is required beyond serving the snapshot artifacts. Updating the S3 bucket automatically gives the UI new data.
- Local run:
cd backend/go && go run ./cmd/local(requires.envwith OpenAI key, AWS creds, query, etc.). - Tests:
cd backend/go && go test ./.... - Package Lambdas:
cd backend/go && make zip-scraper && make zip-snapshot→bin/scraper/lambda.zip,bin/snapshot/lambda.zip. - Deploy (Terraform):
cd infra/terraform/go-serverless && terraform init && terraform apply -var-file=terraform.tfvars.
OPENAI_API_KEY(orAPI_DRY_RUN=true),QUERY,MAX_PAGES,MAX_CONCURRENCY.- Cache flags:
USE_JOB_ID_FILE,USE_S3_JOB_ID_FILE,JOB_IDS_BUCKET,JOB_IDS_S3_KEY. - Data plane:
DYNAMODB_TABLE_NAME,SNAPSHOT_BUCKET,SNAPSHOT_S3_KEY. - Snapshot range overrides:
SNAPSHOT_START_DATE,SNAPSHOT_END_DATE. SNAPSHOT_LAMBDA_FUNCTION_NAMEso the scraper can trigger exports after new writes.
Per-day JSONL under s3://<bucket>/<prefix>/<YYYY-MM-DD>.jsonl plus snapshot-manifest.json. Example record:
{
"jobId": "abc123",
"title": "Senior Backend Engineer",
"company": "Example Tech",
"location": "Seattle, WA",
"modality": "Remote",
"postedDate": "2025-12-03",
"postedTime": "2025-12-03T14:35:00Z",
"salary": "$150k - $180k",
"url": "https://worksourcewa.com/...",
"minYearsExperience": 5,
"minDegree": "Bachelor's",
"domain": "Backend",
"languages": ["Go", "Python", "Java"],
"technologies": ["S3", "DynamoDB", "K8s", "..."],
}- No authentication or personalization; the UI is a public read-only view of whatever the latest snapshots contain.
