feat: filter expired and stale job postings by posting date#181
Open
matkatmusic wants to merge 2 commits intosantifer:mainfrom
Open
feat: filter expired and stale job postings by posting date#181matkatmusic wants to merge 2 commits intosantifer:mainfrom
matkatmusic wants to merge 2 commits intosantifer:mainfrom
Conversation
Adds freshness filtering to scan + pipeline so months-old cached search results are discarded before they consume evaluation tokens. Related to santifer#163. Detection (priority order): - LinkedIn URL ID heuristic catches 2y+ old postings with zero network - JSON-LD datePosted from Greenhouse/Ashby/Lever/LinkedIn - Inline "datePosted" patterns (minified embeds) - Visible text patterns (ISO, long-form, "Posted N days/weeks/months ago") - Existing Greenhouse ?error=true redirect + body text signals check-liveness.mjs extensions: - New --fetch-mode for batch workers (no Playwright dep) - New --json output for machine consumption - New --classify output ("fresh|stale|expired|unverified") - Honors freshness: config in portals.yml (defaults: 60d max, 30d warn) - Fallback chain + SPA short-circuit when JSON-LD provides positive date - Pure functions exported for testing Mode file changes: - scan.md step 7.5 reworked to call check-liveness (Playwright or fetch) - New skipped_stale status in scan-history.tsv - pipeline.md adds pre-eval freshness check; stale URLs get minimal SKIPPED_STALE report instead of full A-F evaluation - _shared.md codifies automatic Red Flags penalty for stale postings - **Posted:** field now required in every report header Config + docs: - freshness: block added to templates/portals.example.yml - ARCHITECTURE.md documents the detection pipeline - CUSTOMIZATION.md explains tuning thresholds per market type Tests: - test-freshness.mjs: 40 unit tests covering JSON-LD, visible text, LinkedIn ID heuristic, boundaries, config loader - Wired into test-all.mjs End-to-end verification against a curated corpus from the user's first real scan: 9/9 known-stale URLs caught, 1/1 known-fresh preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in santifer's v1.3.0 release including:
- liveness-core.mjs extraction (clean pure-function classifier)
- check-liveness.mjs refactor (visible apply control filtering, fixes
Workday split-view false positives)
- scan.mjs zero-token portal scanner (Greenhouse/Ashby/Lever APIs)
- Block G "Posting Legitimacy" qualitative assessment in _shared.md
- followup-cadence.mjs follow-up tracker
- Japanese + Russian language modes
- Nix flake devshell, community files (CoC, governance, security)
Conflict resolution strategy:
liveness-core.mjs:
Extended santifer's classifyLiveness with the freshness pure functions
(extractPostingDate, linkedinIdToYear, ageInDays, classifyFreshness,
loadFreshnessConfig, FRESHNESS_DEFAULTS). One module, two concerns,
both pure — keeps the architecture clean.
check-liveness.mjs:
Took santifer's improved Playwright shell verbatim (visible apply
controls filtering nav/header/footer) and added a freshness layer
on top: extracts datePosted from the rendered HTML and consults
classifyFreshness. Added --fetch-mode for batch workers (no
Playwright dep), --json/--classify CLI flags. Added LinkedIn ToS
guard: fetch-mode never makes HTTP requests to linkedin.com (per
CONTRIBUTING.md). Recent LinkedIn URLs return 'unverified' so the
user can verify manually.
modes/_shared.md:
Both my freshness penalty rules and santifer's Block G now coexist.
Added a bridge sentence explaining the layering: freshness is the
deterministic pre-filter, Block G is the qualitative assessment of
what survives the filter. Both consume the same datePosted signal.
modes/scan.md:
Took santifer's improved 'active' criterion (visible apply control
in main content, not nav/footer) and merged with my freshness
instructions. Added the LinkedIn ToS exception for batch mode.
test-all.mjs:
Kept BOTH new test sections — the freshness unit tests (2b) and
santifer's liveness classification tests (3). Updated freshness
test description to reference liveness-core.mjs.
test-freshness.mjs:
Updated imports to point at liveness-core.mjs (where the pure
functions now live).
Verification after merge:
- node test-freshness.mjs → 40/40 pass
- node test-all.mjs --quick → 64 passed (was 59 pre-merge), 6 failures
unrelated (pre-existing absolute paths in untracked debates/)
- Live smoke test: Suno active (8d JSON-LD), Bloomberg LinkedIn 2020
ID expired without fetch, recent LinkedIn ID returns unverified
without fetch (ToS guard)
- Block G's "Posting age" signal now reads from the same datePosted
field that the freshness filter consumes — one extraction, two
consumers
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #163.
Summary
Adds freshness filtering to
scanandpipelineso months-old cached search results are discarded before they consume evaluation tokens. The existing Step 7.5 liveness check inscan.mdonly runs via Playwright (unavailable in batch mode /claude -p), so in batch workflows nothing currently enforces freshness at all. On a first-time scan with 58 results, ~25% turned out to be 2-4 year old LinkedIn postings or already-filled Greenhouse reqs that still showed apply buttons.Approach
Extended the existing
check-liveness.mjsrather than adding a new script. Detection is layered by cost:datePosted— embedded server-side by Greenhouse, Ashby, Lever, and most LinkedIn postings. Works throughfetch()(no JS execution needed)."datePosted":"..."patterns — for minified embeds outside JSON-LD blocks.Posted YYYY-MM-DD,Posted Aug 15, 2025,Posted N days/weeks/months ago.?error=trueredirect + body text patterns.If JSON-LD returns a fresh
datePosted, that short-circuits the "insufficient content" rejection rule — SPAs like Ashby have rich JSON-LD payloads but very little stripped bodyText, so the old check incorrectly flagged them expired.Changes
check-liveness.mjs— +254 lines--fetch-mode(HTTP-only, parallel-safe, for batch workers)--jsonoutput (machine-consumable)--classifyoutput (justfresh|stale|expired|unverified)extractPostingDate(),linkedinIdToYear(),classifyFreshness(),loadFreshnessConfig()— all exported for teststemplates/portals.example.yml— new `freshness:` block:```yaml
freshness:
max_age_days: 60 # Hard skip
warn_age_days: 30 # Evaluate but apply Red Flags penalty
linkedin_suspect: true # Treat LinkedIn cache snippets as unverified
require_date: false # Strict mode opt-in
```
modes/scan.md— Step 7.5 reworked to callcheck-liveness(Playwright or--fetch-mode). Newskipped_stalestatus inscan-history.tsv. Output summary includes stale count.modes/pipeline.md— New Step 2c freshness pre-filter. Stale URLs get a minimalSKIPPED_STALEreport without an A-F eval (saves tokens on dead links).**Posted:**is now a required field in all report headers.modes/_shared.md— Codifies the automatic Red Flags penalty for stale postings (-0.5forstale, pipeline.md skipsexpiredentirely). Adds ALWAYS rule and Tools table entry.docs/ARCHITECTURE.md+docs/CUSTOMIZATION.md— Full freshness section with detection pipeline diagram and per-market tuning guidance.Tests
New
test-freshness.mjs— 40 pure unit tests (no network, no Playwright):@graph, multiple blocks, malformed handling, inline minifiedwarn_age_days/max_age_daysWired into
test-all.mjs. Runs in well under a second.End-to-end verification
Ran against a curated corpus from a real first-time scan I did earlier today:
The fresh one is particularly important because naive content-length heuristics would reject Ashby SPAs — the JSON-LD short-circuit fixes that.
Open questions for review
freshness:inportals.ymlbecause it feels like scanner behavior, but a case exists forconfig/profile.ymlsince different users in different markets might want different thresholds. Happy to move it.--calibrate-linkedinmode later.fetch-modedeliberately doesn't execute JS, so fully client-side SPAs (like Apple's careers site) returnuncertainwhen they have no server-rendered JSON-LD. Current behavior: classify asfresh(no date +require_date: false). Open to stricter default if you prefer.Test plan
node test-freshness.mjs— all 40 unit tests passnode test-all.mjs --quick— freshness suite green (pre-existing debate/ absolute path warnings unrelated)--fetch-modeparallel execution is safe (no shared browser state)Data contract
All changes land in the system layer. Zero touches to
cv.md,config/profile.yml,data/*, or user'sportals.yml. Users opt into freshness filtering by copying the block fromportals.example.ymlduring their nextupdate-system.mjs apply.🤖 Generated with Claude Code