feat: filter expired and stale job postings by posting date by matkatmusic · Pull Request #181 · santifer/career-ops

matkatmusic · 2026-04-11T05:10:39Z

Closes #163.

Summary

Adds freshness filtering to scan and pipeline so months-old cached search results are discarded before they consume evaluation tokens. The existing Step 7.5 liveness check in scan.md only runs via Playwright (unavailable in batch mode / claude -p), so in batch workflows nothing currently enforces freshness at all. On a first-time scan with 58 results, ~25% turned out to be 2-4 year old LinkedIn postings or already-filled Greenhouse reqs that still showed apply buttons.

Approach

Extended the existing check-liveness.mjs rather than adding a new script. Detection is layered by cost:

LinkedIn URL ID heuristic — job IDs are sequential and leak posting year. Rejects 2y+ old LinkedIn URLs with zero network.
JSON-LD datePosted — embedded server-side by Greenhouse, Ashby, Lever, and most LinkedIn postings. Works through fetch() (no JS execution needed).
Inline "datePosted":"..." patterns — for minified embeds outside JSON-LD blocks.
Visible text patterns — Posted YYYY-MM-DD, Posted Aug 15, 2025, Posted N days/weeks/months ago.
Existing ?error=true redirect + body text patterns.

If JSON-LD returns a fresh datePosted, that short-circuits the "insufficient content" rejection rule — SPAs like Ashby have rich JSON-LD payloads but very little stripped bodyText, so the old check incorrectly flagged them expired.

Changes

check-liveness.mjs — +254 lines

New --fetch-mode (HTTP-only, parallel-safe, for batch workers)
New --json output (machine-consumable)
New --classify output (just fresh|stale|expired|unverified)
New extractPostingDate(), linkedinIdToYear(), classifyFreshness(), loadFreshnessConfig() — all exported for tests
Existing Playwright path preserved and extended with the same date extraction

templates/portals.example.yml — new `freshness:` block:
```yaml
freshness:
max_age_days: 60 # Hard skip
warn_age_days: 30 # Evaluate but apply Red Flags penalty
linkedin_suspect: true # Treat LinkedIn cache snippets as unverified
require_date: false # Strict mode opt-in
```

modes/scan.md — Step 7.5 reworked to call check-liveness (Playwright or --fetch-mode). New skipped_stale status in scan-history.tsv. Output summary includes stale count.

modes/pipeline.md — New Step 2c freshness pre-filter. Stale URLs get a minimal SKIPPED_STALE report without an A-F eval (saves tokens on dead links). **Posted:** is now a required field in all report headers.

modes/_shared.md — Codifies the automatic Red Flags penalty for stale postings (-0.5 for stale, pipeline.md skips expired entirely). Adds ALWAYS rule and Tools table entry.

docs/ARCHITECTURE.md + docs/CUSTOMIZATION.md — Full freshness section with detection pipeline diagram and per-market tuning guidance.

Tests

New test-freshness.mjs — 40 pure unit tests (no network, no Playwright):

JSON-LD extraction: top-level, @graph, multiple blocks, malformed handling, inline minified
Visible text patterns: ISO, long-form, days/weeks/months ago, null inputs
LinkedIn ID heuristic: 7 buckets + edge cases
Freshness classification: all 4 states + boundaries at warn_age_days/max_age_days
Config loader: default fallback + portals.yml override

Wired into test-all.mjs. Runs in well under a second.

End-to-end verification

Ran against a curated corpus from a real first-time scan I did earlier today:

Set	Result
9 known-stale URLs (5 old LinkedIn, 2 Greenhouse redirects, 1 Ashby 138d old, 1 Lever 404)	9/9 caught
1 known-fresh URL (Suno DAW engineer, 8d old, JSON-LD confirmed)	1/1 preserved

The fresh one is particularly important because naive content-length heuristics would reject Ashby SPAs — the JSON-LD short-circuit fixes that.

Open questions for review

I put freshness: in portals.yml because it feels like scanner behavior, but a case exists for config/profile.yml since different users in different markets might want different thresholds. Happy to move it.
The LinkedIn ID→year table needs periodic recalibration. I added a comment noting this but no automation yet. Could add a --calibrate-linkedin mode later.
fetch-mode deliberately doesn't execute JS, so fully client-side SPAs (like Apple's careers site) return uncertain when they have no server-rendered JSON-LD. Current behavior: classify as fresh (no date + require_date: false). Open to stricter default if you prefer.

Test plan

node test-freshness.mjs — all 40 unit tests pass
node test-all.mjs --quick — freshness suite green (pre-existing debate/ absolute path warnings unrelated)
End-to-end: 9/9 stale caught, 1/1 fresh preserved against real URLs
Verify Playwright path still works on active Greenhouse/Ashby/Lever URL
Verify --fetch-mode parallel execution is safe (no shared browser state)
Full scan + pipeline run on user's real portals.yml (deferred to post-merge)

Data contract

All changes land in the system layer. Zero touches to cv.md, config/profile.yml, data/*, or user's portals.yml. Users opt into freshness filtering by copying the block from portals.example.yml during their next update-system.mjs apply.

🤖 Generated with Claude Code

Adds freshness filtering to scan + pipeline so months-old cached search results are discarded before they consume evaluation tokens. Related to santifer#163. Detection (priority order): - LinkedIn URL ID heuristic catches 2y+ old postings with zero network - JSON-LD datePosted from Greenhouse/Ashby/Lever/LinkedIn - Inline "datePosted" patterns (minified embeds) - Visible text patterns (ISO, long-form, "Posted N days/weeks/months ago") - Existing Greenhouse ?error=true redirect + body text signals check-liveness.mjs extensions: - New --fetch-mode for batch workers (no Playwright dep) - New --json output for machine consumption - New --classify output ("fresh|stale|expired|unverified") - Honors freshness: config in portals.yml (defaults: 60d max, 30d warn) - Fallback chain + SPA short-circuit when JSON-LD provides positive date - Pure functions exported for testing Mode file changes: - scan.md step 7.5 reworked to call check-liveness (Playwright or fetch) - New skipped_stale status in scan-history.tsv - pipeline.md adds pre-eval freshness check; stale URLs get minimal SKIPPED_STALE report instead of full A-F evaluation - _shared.md codifies automatic Red Flags penalty for stale postings - **Posted:** field now required in every report header Config + docs: - freshness: block added to templates/portals.example.yml - ARCHITECTURE.md documents the detection pipeline - CUSTOMIZATION.md explains tuning thresholds per market type Tests: - test-freshness.mjs: 40 unit tests covering JSON-LD, visible text, LinkedIn ID heuristic, boundaries, config loader - Wired into test-all.mjs End-to-end verification against a curated corpus from the user's first real scan: 9/9 known-stale URLs caught, 1/1 known-fresh preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Brings in santifer's v1.3.0 release including: - liveness-core.mjs extraction (clean pure-function classifier) - check-liveness.mjs refactor (visible apply control filtering, fixes Workday split-view false positives) - scan.mjs zero-token portal scanner (Greenhouse/Ashby/Lever APIs) - Block G "Posting Legitimacy" qualitative assessment in _shared.md - followup-cadence.mjs follow-up tracker - Japanese + Russian language modes - Nix flake devshell, community files (CoC, governance, security) Conflict resolution strategy: liveness-core.mjs: Extended santifer's classifyLiveness with the freshness pure functions (extractPostingDate, linkedinIdToYear, ageInDays, classifyFreshness, loadFreshnessConfig, FRESHNESS_DEFAULTS). One module, two concerns, both pure — keeps the architecture clean. check-liveness.mjs: Took santifer's improved Playwright shell verbatim (visible apply controls filtering nav/header/footer) and added a freshness layer on top: extracts datePosted from the rendered HTML and consults classifyFreshness. Added --fetch-mode for batch workers (no Playwright dep), --json/--classify CLI flags. Added LinkedIn ToS guard: fetch-mode never makes HTTP requests to linkedin.com (per CONTRIBUTING.md). Recent LinkedIn URLs return 'unverified' so the user can verify manually. modes/_shared.md: Both my freshness penalty rules and santifer's Block G now coexist. Added a bridge sentence explaining the layering: freshness is the deterministic pre-filter, Block G is the qualitative assessment of what survives the filter. Both consume the same datePosted signal. modes/scan.md: Took santifer's improved 'active' criterion (visible apply control in main content, not nav/footer) and merged with my freshness instructions. Added the LinkedIn ToS exception for batch mode. test-all.mjs: Kept BOTH new test sections — the freshness unit tests (2b) and santifer's liveness classification tests (3). Updated freshness test description to reference liveness-core.mjs. test-freshness.mjs: Updated imports to point at liveness-core.mjs (where the pure functions now live). Verification after merge: - node test-freshness.mjs → 40/40 pass - node test-all.mjs --quick → 64 passed (was 59 pre-merge), 6 failures unrelated (pre-existing absolute paths in untracked debates/) - Live smoke test: Suno active (8d JSON-LD), Bloomberg LinkedIn 2020 ID expired without fetch, recent LinkedIn ID returns unverified without fetch (ToS guard) - Block G's "Posting age" signal now reads from the same datePosted field that the freshness filter consumes — one extraction, two consumers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

matkatmusic and others added 2 commits April 10, 2026 22:06

matkatmusic mentioned this pull request Apr 11, 2026

fix(test-all): scan only tracked files via git grep #185

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: filter expired and stale job postings by posting date#181

feat: filter expired and stale job postings by posting date#181
matkatmusic wants to merge 2 commits intosantifer:mainfrom
matkatmusic:filter-expired-results

matkatmusic commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

matkatmusic commented Apr 11, 2026

Summary

Approach

Changes

Tests

End-to-end verification

Open questions for review

Test plan

Data contract

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant