Job intelligence agent that turns generated searches into a ranked action queue.
Search all major job boards, get one de-duplicated, ranked list of best-match jobs.
# Clone and install
git clone https://github.com/ycb/job-finder.git
cd job-finder
npm install
npm link # Makes 'jf' command available globally
# Initialize local database
jf init
# Open dashboard to review jobs
jf review
# Opens http://localhost:4311
# Enter search input in the dashboard and click "Find Jobs"
That's it. Dashboard shows de-duplicated jobs ranked from your search input.
The dashboard is designed around the actual workflow:
- enter search input and trigger intake from one place
- review one de-duped, prioritized queue instead of duplicate listings
- move completed applications out of the work queue and into a separate applied list
Job Finder is an intelligence agent for turning noisy job discovery into a ranked action queue. Runs locally, uses Codex and Playwright MCP.
Instead of acting like another job board, generic scraper, or chat wrapper, it models your search as a repeatable system:
- structured search input
- automatically generated searches across supported job sources
- browser-driven intake into a local database
- deterministic fit scoring against your target criteria
- a de-duped review queue with lightweight application tracking
This repo is intentionally opinionated about the workflow: automate the repetitive intake and triage, keep the decision-making local, and preserve human review before anything high-stakes.
The current implementation focuses on:
- search-input-driven intake and scoring
- browser-capture intake for LinkedIn / Wellfound / Ashby / Google / Indeed / ZipRecruiter / RemoteOK
- Built In ingestion from configured search URLs
- job intake into SQLite
- deterministic scoring with hard filters, AND/OR keyword mode, include/exclude terms, freshness, confidence, and history-aware signals
- capture-quality guardrails with quarantine/reject protection and source health telemetry
- run delta tracking (
new,updated,unchanged) across source refreshes - status-aware retention cleanup with per-status TTLs and audit logs
- shortlist generation
- de-duped review and application tracking
The current intake adapters support:
- live capture through the local browser bridge (
chrome_applescriptby default on macOS) - source-level capture caching with TTL controls and
--force-refreshoverrides - LinkedIn snapshot import from
output/playwright/<source-id>-snapshot.mdas a fallback - HTTP fetch parsers for URL-driven ingestion flows during
sync
The scoring and review pipeline stays stable across both.
The useful part of this project is not "AI chat for jobs." The differentiation is the system design:
- local-first control over search input, source execution, and application history
- browser automation tied to real search runs instead of a generic feed
- structured, inspectable scoring instead of opaque ranking
- de-dupe across overlapping searches so the review queue stays actionable
- a human-in-the-loop review loop that is fast enough to use daily
That makes it a stronger demonstration of AI-native product thinking than a thin wrapper around an LLM prompt. The current ranking is deterministic by design; the architecture leaves room for LLM-assisted drafting or orchestration later without making the core workflow depend on it.
- Start
npm run reviewand open the dashboard. - Enter search input and click
Find Jobs. - Let the system generate and run searches automatically.
- Review ranked jobs and mark outcomes (
applied,skip_for_now,rejected). - Iterate search input and rerun.
run and review are separate processes:
npm run runupdates data (capture, sync, score, shortlist).npm run reviewserves the local UI athttp://127.0.0.1:4311.- if you only run
run, dashboard data changes will not appear untilreviewis running and refreshed.
- Run
jf initto initialize the SQLite database. - Start the dashboard with
npm run review. - Use dashboard search input +
Find Jobsto run the pipeline. - No
profile.json/my-goals.jsonsetup is required for scoring. - No manual search/source creation is required in normal workflow.
Fallback snapshot workflow:
- save Playwright snapshots under
output/playwright/<source-id>-snapshot.md - run
npm run capture:all - then run
npm run run
For the full automated daily path:
npm run run
Daily workflow:
npm run reviewnpm run run(full pipeline: capture -> sync -> score -> shortlist -> list)npm run run -- --force-refresh(ignore capture cache TTLs)npm run run -- --allow-quarantined(operator override; ingest quarantine outcomes)npm run run:safenpm run run:probenpm run run:mock
Quality and diagnostics:
node src/cli.js check-source-contracts [--window 3] [--min-coverage 0.9] [--stale-days 30]node src/cli.js check-source-canaries [--include-disabled]node src/cli.js retention-policy(show effective retention policy + policy path)npm run sync -- --allow-quarantined(override ingest gate for quarantine outcomes)- Quality artifacts:
- quarantine runs:
data/quality/quarantine/<source-id>/*.json - source health history:
data/quality/source-health-history.json - contract coverage history:
data/quality/source-coverage-history.json - contract drift diagnostics:
data/quality/contract-drift/latest.json - canary diagnostics:
data/quality/canary-checks/latest.json - retention cleanup audit:
data/retention/cleanup-audit.jsonl - analytics events/counters:
data/analytics/events.jsonl,data/analytics/counters.json
- quarantine runs:
Capture/source operations:
npm run sourcesnode src/cli.js normalize-source-urls --dry-runnpm run capture -- <source-id-or-label> [snapshot-path]npm run capture:allnpm run capture:live -- <source-id-or-label> [snapshot-path]npm run capture:all:livenpm run bridgenode src/cli.js import-linkedin-snapshot <source-id-or-label> <snapshot-path>node src/cli.js open-source <source-id-or-label>node src/cli.js open-sources
Legacy/manual source management (optional, not required for normal dashboard flow):
node src/cli.js add-source <label> <url>node src/cli.js add-builtin-source <label> <url>node src/cli.js add-google-source <label> <url> [any|1d|1w|1m]node src/cli.js add-wellfound-source <label> <url>node src/cli.js add-ashby-source <label> <url> [any|1d|1w|1m]node src/cli.js add-indeed-source <label> <url>node src/cli.js add-ziprecruiter-source <label> <url>node src/cli.js add-remoteok-source <label> <url>node src/cli.js set-source-url <source-id-or-label> <url>
Other:
npm run initnpm run scorenpm run shortlist(writesoutput/shortlist.json)npm run listnpm run mark -- <job-id> <status>npm run review:safenpm run review:probenpm run review:mocknpm run run:live(compatibility alias forrun)
npm run review starts the local dashboard for search management and job review.
The dashboard includes:
- top-level tabs:
Jobs,Searches - search input controls with a single
Find Jobsaction - keyword targeting controls (
AND/ORmode + include/exclude terms) - automatically generated searches across supported sources
- a de-duped ranked queue with selected-job detail and
Prev/Nextnavigation inJobs - job views:
All,New,Best Match,Applied,Skipped,Rejected - source-kind job filters in
Jobs(for example, LinkedIn/Built In/Ashby) Searchestab with Enabled/Disabled source lifecycle controls, auth-aware actions, and funnel metrics:Found,Filtered,Dupes,Imported,Avg Score- run-delta context in search/source status (
new,updated,unchanged) for latest refresh Foundshown asimported/expectedwhen expected totals are detectable, otherwiseimported/?- source refresh/capture status signals including cache/live state
- per-source criteria-accountability metadata (URL-applied, UI-bootstrap, post-capture, unsupported)
- per-source adapter health (
ok/degraded/failing) with reason hints - row click-through from
Searchesinto filteredJobsview - per-job attribution showing which source/search URLs surfaced the role
Jobs found in multiple searches are grouped into one review row and show which searches surfaced them.
Scoring is deterministic and driven by search input (Find Jobs).
Each job is evaluated into high_signal, review_later, or reject using weighted criteria:
- title match (35)
- keywords match ratio (25)
- location match (15)
- salary floor match (15)
- freshness target (
datePosted) match (10)
Scoring notes:
- keyword terms are split from comma/semicolon/
and-style input - keyword mode defaults to
AND; switching toORtreats any matched positive term as a keyword win includeTermsare merged into positive keyword matchingexcludeTermsare treated as hard filters before scoring (matched jobs are rejected)- AI-like tokens (
ai,ml,llm,genai) map to broader AI phrase matching - title mismatch is strongly penalized (score cap path)
- source hard filters still run before scoring when configured in
sources.json
Global search-construction criteria in config/source-criteria.json (legacy config/search-criteria.json fallback still supported):
title,keywords,keywordMode,includeTerms,excludeTerms,location,minSalary,distanceMiles,datePosted,experienceLevel- these are used as the default canonical variables when constructing source URLs
- the dashboard
JobstabFind Jobscontrol provides a single editor for these fields (Title,Keyword,Keyword Mode,Include,Exclude,Location,Salary,Posted on)
Per-source quality/caching knobs in config/sources.json:
requiredTerms(optional string array): every job must match all terms before it enters scoring/ranking.cacheTtlHours(optional number): source-level TTL override.searchCriteria(optional object): per-source override on top of global criteria for the same canonical fields.hardFilter(optional object): explicit filter controls before scoring.requiredAll(string array): all terms must match in selected fields.requiredAny(string array): at least one term must match in selected fields.excludeAny(string array): any matching term drops the job before scoring.fields(string array): fields to evaluate (title,summary,description,location,company).enforceContentOnSnippets(boolean): whenfalse, content checks can be deferred for thin snippets.
searchCriteriacurrently stubs (no URL application) forwellfound_search; Wellfound is treated as a UI-bootstrap outlier.searchCriteria.minSalaryis intentionally not applied toashby_searchURL construction; it remains available to scoring.- Default TTLs:
12hfor HTTP sources (for example, Built In) and24hfor browser-capture sources (LinkedIn/Wellfound/Ashby/Google/Indeed/ZipRecruiter/RemoteOK).
Source contract governance:
- Canonical source mapping registry:
config/source-contracts.json - Drift-check command:
node src/cli.js check-source-contracts(defaultminCoverageis0.9) - Governance and update workflow:
docs/analysis/source-contract-governance.md
Capture quality guardrails:
- Ingest evaluates each source capture as
accept,quarantine, orreject. - Default behavior ingests only
accept;quarantine/rejectruns are blocked from normal ingest. - Operator override is explicit:
sync --allow-quarantinedorrun --allow-quarantined. - Quarantined/rejected runs persist diagnostic artifacts under
data/quality/quarantine/. - Source health is tracked over rolling runs in
data/quality/source-health-history.json. - Contract drift diagnostics are persisted under
data/quality/contract-drift/. - Canary checks are configurable in
config/source-canaries.jsonand run withcheck-source-canaries.
Sync runtime telemetry:
- Every
syncrun prints aggregate run deltas (new,updated,unchanged). - Per-source run deltas are persisted and surfaced in dashboard source status rows.
- Sync applies retention cleanup by status (
new,viewed,skip_for_now,rejected;appliedis protected by default). - Retention cleanup writes audit rows to
data/retention/cleanup-audit.jsonl. - CLI and dashboard paths emit local analytics events (
data/analytics/events.jsonl) and counters (data/analytics/counters.json), with optional PostHog forwarding whenPOSTHOG_API_KEYis set.
Refresh policy behavior:
- The UI
Refreshaction can serve cache or run live capture depending on policy/state. - Source-level policy enforces min interval, daily cap, and cooldown after challenge/captcha signals.
- Dashboard/API status fields now include:
refreshMode(safe,probe,mock)servedFrom(liveorcache)nextEligibleAtcooldownUntilstatusLabel/statusReason
- Invalid
JOB_FINDER_REFRESH_PROFILEvalues now fail fast with an actionable error.
What the score is used for:
- rank ordering in
Jobs > Active - bucket assignment (
high_signal,review_later,reject) - per-source quality rollups in
Searches(high signal %, average score)
Supported statuses: new, viewed, applied, skip_for_now, rejected.
newandviewedstay in the actionable queue (Jobstab active views)appliedmoves intoJobs > Appliedskip_for_nowmoves intoJobs > Skippedrejectedmoves intoJobs > Rejected- rejecting a job requires a reason, which is stored as a note
- sync pruning removes stale
new/viewedrecords per source when they no longer appear in current capture results - newly captured jobs can inherit existing application status by
normalized_hash(for example, previously rejected duplicates stay rejected) - status-aware retention cleanup runs during sync by default:
new: 30 daysviewed: 45 daysskip_for_now: 21 daysrejected: 14 daysapplied: never auto-deleted
- override retention defaults with
config/retention-policy.json(inspect effective policy viajf retention-policy)
Browser-capture sources require a browser bridge service. npm run run and dashboard Find Jobs auto-start a local bridge when needed.
capture-source-live now auto-starts a persistent local bridge process when one is not running, so sequential source captures can reuse one Chrome automation window/tab instead of opening a new window per source.
Bridge safety boundary:
- MCP v1 bridge surface is read-only (
/health,/capture-source,/capture-linkedin-source). - Write-side browser actions are intentionally excluded from MCP v1 route registration.
Manual bridge startup is still available:
- Start it with
node src/cli.js bridge-server [port] [provider] - Default port:
4315 - Default provider:
chrome_applescript - Stop a detached bridge process with
pkill -f "src/cli.js bridge-server" - Fastest automation on macOS:
chrome_applescript - Temporary fallback provider:
playwright_cli - Manual handoff fallback provider:
persistent_scaffold
chrome_applescript captures directly from the active Chrome tab and does not require Playwright snapshots.
LinkedIn live capture now includes multi-page traversal (start=0, 25, 50, ...) and stores expectedCount when extractable for capture verification.
Browser-capture source types:
linkedin_capture_filewellfound_searchashby_searchgoogle_searchindeed_searchziprecruiter_searchremoteok_search
One-time Chrome setup:
- In Chrome, open
View - Open
Developer - Enable
Allow JavaScript from Apple Events
persistent_scaffold is a stateful handoff flow:
capture-source-liveopens the saved search and writes a pending request file.- You save a fresh Playwright snapshot to the requested path.
- You rerun the same capture command.
- The bridge detects the fresh snapshot, imports it, and completes the capture.
The playwright_cli provider still depends on:
- the Playwright MCP Bridge browser extension being connected
- a valid
PLAYWRIGHT_MCP_EXTENSION_TOKENin your environment or~/.codex/config.toml - the local Playwright CLI wrapper at
~/.codex/skills/playwright/scripts/playwright_cli.sh
If the browser bridge or provider is unavailable, use the snapshot import flow instead.
Dashboard feature flags:
JOB_FINDER_ENABLE_WELLFOUND=1: enable Wellfound source visibility/creation in review UIJOB_FINDER_ENABLE_REMOTEOK=1: enable RemoteOK source visibility/creation in review UI
- Privacy policy:
PRIVACY.md - Terms of use:
TERMS.md