Skip to content

feat: filter expired and stale job postings by posting date#181

Open
matkatmusic wants to merge 2 commits intosantifer:mainfrom
matkatmusic:filter-expired-results
Open

feat: filter expired and stale job postings by posting date#181
matkatmusic wants to merge 2 commits intosantifer:mainfrom
matkatmusic:filter-expired-results

Conversation

@matkatmusic
Copy link
Copy Markdown

Closes #163.

Summary

Adds freshness filtering to scan and pipeline so months-old cached search results are discarded before they consume evaluation tokens. The existing Step 7.5 liveness check in scan.md only runs via Playwright (unavailable in batch mode / claude -p), so in batch workflows nothing currently enforces freshness at all. On a first-time scan with 58 results, ~25% turned out to be 2-4 year old LinkedIn postings or already-filled Greenhouse reqs that still showed apply buttons.

Approach

Extended the existing check-liveness.mjs rather than adding a new script. Detection is layered by cost:

  1. LinkedIn URL ID heuristic — job IDs are sequential and leak posting year. Rejects 2y+ old LinkedIn URLs with zero network.
  2. JSON-LD datePosted — embedded server-side by Greenhouse, Ashby, Lever, and most LinkedIn postings. Works through fetch() (no JS execution needed).
  3. Inline "datePosted":"..." patterns — for minified embeds outside JSON-LD blocks.
  4. Visible text patternsPosted YYYY-MM-DD, Posted Aug 15, 2025, Posted N days/weeks/months ago.
  5. Existing ?error=true redirect + body text patterns.

If JSON-LD returns a fresh datePosted, that short-circuits the "insufficient content" rejection rule — SPAs like Ashby have rich JSON-LD payloads but very little stripped bodyText, so the old check incorrectly flagged them expired.

Changes

check-liveness.mjs — +254 lines

  • New --fetch-mode (HTTP-only, parallel-safe, for batch workers)
  • New --json output (machine-consumable)
  • New --classify output (just fresh|stale|expired|unverified)
  • New extractPostingDate(), linkedinIdToYear(), classifyFreshness(), loadFreshnessConfig() — all exported for tests
  • Existing Playwright path preserved and extended with the same date extraction

templates/portals.example.yml — new `freshness:` block:
```yaml
freshness:
max_age_days: 60 # Hard skip
warn_age_days: 30 # Evaluate but apply Red Flags penalty
linkedin_suspect: true # Treat LinkedIn cache snippets as unverified
require_date: false # Strict mode opt-in
```

modes/scan.md — Step 7.5 reworked to call check-liveness (Playwright or --fetch-mode). New skipped_stale status in scan-history.tsv. Output summary includes stale count.

modes/pipeline.md — New Step 2c freshness pre-filter. Stale URLs get a minimal SKIPPED_STALE report without an A-F eval (saves tokens on dead links). **Posted:** is now a required field in all report headers.

modes/_shared.md — Codifies the automatic Red Flags penalty for stale postings (-0.5 for stale, pipeline.md skips expired entirely). Adds ALWAYS rule and Tools table entry.

docs/ARCHITECTURE.md + docs/CUSTOMIZATION.md — Full freshness section with detection pipeline diagram and per-market tuning guidance.

Tests

New test-freshness.mjs — 40 pure unit tests (no network, no Playwright):

  • JSON-LD extraction: top-level, @graph, multiple blocks, malformed handling, inline minified
  • Visible text patterns: ISO, long-form, days/weeks/months ago, null inputs
  • LinkedIn ID heuristic: 7 buckets + edge cases
  • Freshness classification: all 4 states + boundaries at warn_age_days/max_age_days
  • Config loader: default fallback + portals.yml override

Wired into test-all.mjs. Runs in well under a second.

End-to-end verification

Ran against a curated corpus from a real first-time scan I did earlier today:

Set Result
9 known-stale URLs (5 old LinkedIn, 2 Greenhouse redirects, 1 Ashby 138d old, 1 Lever 404) 9/9 caught
1 known-fresh URL (Suno DAW engineer, 8d old, JSON-LD confirmed) 1/1 preserved

The fresh one is particularly important because naive content-length heuristics would reject Ashby SPAs — the JSON-LD short-circuit fixes that.

Open questions for review

  1. I put freshness: in portals.yml because it feels like scanner behavior, but a case exists for config/profile.yml since different users in different markets might want different thresholds. Happy to move it.
  2. The LinkedIn ID→year table needs periodic recalibration. I added a comment noting this but no automation yet. Could add a --calibrate-linkedin mode later.
  3. fetch-mode deliberately doesn't execute JS, so fully client-side SPAs (like Apple's careers site) return uncertain when they have no server-rendered JSON-LD. Current behavior: classify as fresh (no date + require_date: false). Open to stricter default if you prefer.

Test plan

  • node test-freshness.mjs — all 40 unit tests pass
  • node test-all.mjs --quick — freshness suite green (pre-existing debate/ absolute path warnings unrelated)
  • End-to-end: 9/9 stale caught, 1/1 fresh preserved against real URLs
  • Verify Playwright path still works on active Greenhouse/Ashby/Lever URL
  • Verify --fetch-mode parallel execution is safe (no shared browser state)
  • Full scan + pipeline run on user's real portals.yml (deferred to post-merge)

Data contract

All changes land in the system layer. Zero touches to cv.md, config/profile.yml, data/*, or user's portals.yml. Users opt into freshness filtering by copying the block from portals.example.yml during their next update-system.mjs apply.

🤖 Generated with Claude Code

matkatmusic and others added 2 commits April 10, 2026 22:06
Adds freshness filtering to scan + pipeline so months-old cached search
results are discarded before they consume evaluation tokens. Related to
santifer#163.

Detection (priority order):
- LinkedIn URL ID heuristic catches 2y+ old postings with zero network
- JSON-LD datePosted from Greenhouse/Ashby/Lever/LinkedIn
- Inline "datePosted" patterns (minified embeds)
- Visible text patterns (ISO, long-form, "Posted N days/weeks/months ago")
- Existing Greenhouse ?error=true redirect + body text signals

check-liveness.mjs extensions:
- New --fetch-mode for batch workers (no Playwright dep)
- New --json output for machine consumption
- New --classify output ("fresh|stale|expired|unverified")
- Honors freshness: config in portals.yml (defaults: 60d max, 30d warn)
- Fallback chain + SPA short-circuit when JSON-LD provides positive date
- Pure functions exported for testing

Mode file changes:
- scan.md step 7.5 reworked to call check-liveness (Playwright or fetch)
- New skipped_stale status in scan-history.tsv
- pipeline.md adds pre-eval freshness check; stale URLs get minimal
  SKIPPED_STALE report instead of full A-F evaluation
- _shared.md codifies automatic Red Flags penalty for stale postings
- **Posted:** field now required in every report header

Config + docs:
- freshness: block added to templates/portals.example.yml
- ARCHITECTURE.md documents the detection pipeline
- CUSTOMIZATION.md explains tuning thresholds per market type

Tests:
- test-freshness.mjs: 40 unit tests covering JSON-LD, visible text,
  LinkedIn ID heuristic, boundaries, config loader
- Wired into test-all.mjs

End-to-end verification against a curated corpus from the user's first
real scan: 9/9 known-stale URLs caught, 1/1 known-fresh preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in santifer's v1.3.0 release including:
- liveness-core.mjs extraction (clean pure-function classifier)
- check-liveness.mjs refactor (visible apply control filtering, fixes
  Workday split-view false positives)
- scan.mjs zero-token portal scanner (Greenhouse/Ashby/Lever APIs)
- Block G "Posting Legitimacy" qualitative assessment in _shared.md
- followup-cadence.mjs follow-up tracker
- Japanese + Russian language modes
- Nix flake devshell, community files (CoC, governance, security)

Conflict resolution strategy:

  liveness-core.mjs:
    Extended santifer's classifyLiveness with the freshness pure functions
    (extractPostingDate, linkedinIdToYear, ageInDays, classifyFreshness,
    loadFreshnessConfig, FRESHNESS_DEFAULTS). One module, two concerns,
    both pure — keeps the architecture clean.

  check-liveness.mjs:
    Took santifer's improved Playwright shell verbatim (visible apply
    controls filtering nav/header/footer) and added a freshness layer
    on top: extracts datePosted from the rendered HTML and consults
    classifyFreshness. Added --fetch-mode for batch workers (no
    Playwright dep), --json/--classify CLI flags. Added LinkedIn ToS
    guard: fetch-mode never makes HTTP requests to linkedin.com (per
    CONTRIBUTING.md). Recent LinkedIn URLs return 'unverified' so the
    user can verify manually.

  modes/_shared.md:
    Both my freshness penalty rules and santifer's Block G now coexist.
    Added a bridge sentence explaining the layering: freshness is the
    deterministic pre-filter, Block G is the qualitative assessment of
    what survives the filter. Both consume the same datePosted signal.

  modes/scan.md:
    Took santifer's improved 'active' criterion (visible apply control
    in main content, not nav/footer) and merged with my freshness
    instructions. Added the LinkedIn ToS exception for batch mode.

  test-all.mjs:
    Kept BOTH new test sections — the freshness unit tests (2b) and
    santifer's liveness classification tests (3). Updated freshness
    test description to reference liveness-core.mjs.

  test-freshness.mjs:
    Updated imports to point at liveness-core.mjs (where the pure
    functions now live).

Verification after merge:
- node test-freshness.mjs → 40/40 pass
- node test-all.mjs --quick → 64 passed (was 59 pre-merge), 6 failures
  unrelated (pre-existing absolute paths in untracked debates/)
- Live smoke test: Suno active (8d JSON-LD), Bloomberg LinkedIn 2020
  ID expired without fetch, recent LinkedIn ID returns unverified
  without fetch (ToS guard)
- Block G's "Posting age" signal now reads from the same datePosted
  field that the freshness filter consumes — one extraction, two
  consumers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Detect and filter stale/expired job postings by posting date

1 participant