Vapor Source

Vapor Source is a full-stack system for scraping, enriching, and exploring software job postings. The current implementation is a cost-minimized Go serverless pipeline; the original Swift + Vapor stack remains in backend/swift/ as legacy. The pipeline now feeds a React insights frontend in frontend/vapor-source/, which renders charts and tables directly against the published snapshots. My main motivation for this project is job boards rarely expose reliable filters for years of experience or true remote eligibility.

Core Functionality

Web scraping: Go Lambda crawls worksourcewa.com.
AI-powered enrichment: OpenAI structured outputs normalize modality, domain, degree, skills, and years of experience.
Canonical storage: Jobs are deduped and stored in DynamoDB (JobId PK, PostedDate sort key, PostedDate-Index GSI).
Job ID cache: In-memory dedupe set is seeded from the S3 job-ids.txt and persisted back, so runs remain idempotent across invocations.
Snapshot export: Snapshot Lambda writes per-day JSONL files to S3 and refreshes snapshot-manifest.json for consumers (fronted by CloudFront).
Legacy (Swift/Vapor): Kept for reference; no longer the canonical path.

Architecture Overview

EventBridge (cron)
      ↓
Scraper Lambda (Go)
      ↕
In-memory dedupe set ⇄ S3 job ID cache (`job-ids.txt`)
      ↓
DynamoDB (JobId PK, PostedDate SK; GSI PostedDate-Index)
      ↓
Snapshot Lambda (Go)
      ↓
S3 (per-day JSONL + snapshot-manifest.json) → CloudFront (proxy)
      ↓
React Insights UI (fetches /snapshots/*)

Scraper Lambda handles crawling, OpenAI enrichment, S3-synced job ID cache seeding/persisting, dedupe, and writes to DynamoDB.
When new rows land, the scraper invokes the snapshot Lambda to emit per-day JSONL files.
CloudFront serves snapshot artifacts without exposing the S3 bucket, and the React UI reads those artifacts for dashboards.

Technology Stack

Languages: Go (current serverless), Swift (legacy)
AI & parsing: OpenAI structured outputs
Data plane: DynamoDB, S3, CloudFront
Compute: AWS Lambda (+ EventBridge triggers)
IaC: Terraform (infra/terraform/go-serverless)
Frontend: React 19 + TypeScript + Vite, Tailwind, TanStack Table, Plotly.js (frontend/vapor-source/)
Hosting: Cloudflare Pages
Legacy: Swift Vapor API + PostgreSQL (retained in backend/swift/)

Project Structure

backend/go/: Go Lambdas (cmd/scraper, cmd/snapshot, cmd/local) and shared libs.
backend/swift/: Legacy Swift Lambda + Vapor server.
frontend/vapor-source/: React UI that reads the published snapshots and renders charts/tables.
infra/terraform/go-serverless/: Terraform for the Go stack (Lambdas, DynamoDB, S3, CloudFront, EventBridge).

Frontend: React Insights UI

The frontend/vapor-source/ app is a static React + TypeScript SPA. It fetches snapshot-manifest.json plus per-day JSONL exports from /snapshots/, filters to software-engineering rows, and renders:

Plotly visualizations (YOE box plots, modality/domain/degree frequency, skills beeswarm, etc.) that aggregate across the fetched days.
Latest jobs table powered by TanStack Table with client-side sorting/filters.

Local Workflow (Frontend)

cd frontend/vapor-source && npm ci.
Point the dev server at real snapshot data either by copying assets into public/snapshots/ (e.g., aws s3 sync s3://<bucket>/<prefix> public/snapshots) or by setting VITE_SNAPSHOT_PROXY_TARGET=https://<cloudfront-domain>/snapshots so Vite proxies /snapshots/* to CloudFront.
npm run dev to launch the UI at http://localhost:5173; it will fetch /snapshots/snapshot-manifest.json via the option you chose above.
npm run build && npm run preview to validate the production bundle locally.

Important

Deployment details:

functions/snapshots/[[path]].ts is the Cloudflare Pages Function that proxies /snapshots/* to the CloudFront distribution; bind CLOUDFRONT_BASE_URL to that origin.
Because the UI is static, no backend is required beyond serving the snapshot artifacts. Updating the S3 bucket automatically gives the UI new data.

Getting Started (Go)

Local run: cd backend/go && go run ./cmd/local (requires .env with OpenAI key, AWS creds, query, etc.).
Tests: cd backend/go && go test ./....
Package Lambdas: cd backend/go && make zip-scraper && make zip-snapshot → bin/scraper/lambda.zip, bin/snapshot/lambda.zip.
Deploy (Terraform): cd infra/terraform/go-serverless && terraform init && terraform apply -var-file=terraform.tfvars.

Key configuration (Go)

OPENAI_API_KEY (or API_DRY_RUN=true), QUERY, MAX_PAGES, MAX_CONCURRENCY.
Cache flags: USE_JOB_ID_FILE, USE_S3_JOB_ID_FILE, JOB_IDS_BUCKET, JOB_IDS_S3_KEY.
Data plane: DYNAMODB_TABLE_NAME, SNAPSHOT_BUCKET, SNAPSHOT_S3_KEY.
Snapshot range overrides: SNAPSHOT_START_DATE, SNAPSHOT_END_DATE.
SNAPSHOT_LAMBDA_FUNCTION_NAME so the scraper can trigger exports after new writes.

Snapshot output

Per-day JSONL under s3://<bucket>/<prefix>/<YYYY-MM-DD>.jsonl plus snapshot-manifest.json. Example record:

{
  "jobId": "abc123",
  "title": "Senior Backend Engineer",
  "company": "Example Tech",
  "location": "Seattle, WA",
  "modality": "Remote",
  "postedDate": "2025-12-03",
  "postedTime": "2025-12-03T14:35:00Z",
  "salary": "$150k - $180k",
  "url": "https://worksourcewa.com/...",
  "minYearsExperience": 5,
  "minDegree": "Bachelor's",
  "domain": "Backend",
  "languages": ["Go", "Python", "Java"],
  "technologies": ["S3", "DynamoDB", "K8s", "..."],
}

Limitations / TODO

No authentication or personalization; the UI is a public read-only view of whatever the latest snapshots contain.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
backend		backend
demo		demo
frontend/vapor-source		frontend/vapor-source
infra/terraform		infra/terraform
.gitignore		.gitignore
.swift-version		.swift-version
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vapor Source

Core Functionality

Architecture Overview

Technology Stack

Project Structure

Frontend: React Insights UI

Local Workflow (Frontend)

Getting Started (Go)

Key configuration (Go)

Snapshot output

Limitations / TODO

About

Uh oh!

Languages

sledrabbit/vapor-source

Folders and files

Latest commit

History

Repository files navigation

Vapor Source

Core Functionality

Architecture Overview

Technology Stack

Project Structure

Frontend: React Insights UI

Local Workflow (Frontend)

Getting Started (Go)

Key configuration (Go)

Snapshot output

Limitations / TODO

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages