Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 264 additions & 38 deletions check-liveness.mjs
Original file line number Diff line number Diff line change
@@ -1,34 +1,76 @@
#!/usr/bin/env node

/**
* check-liveness.mjs — Playwright job link liveness checker
* check-liveness.mjs — Job posting liveness + freshness checker
*
* Tests whether job posting URLs are still active or have expired.
* Uses the same detection logic as scan.md step 7.5.
* Zero Claude API tokens — pure Playwright.
* Two execution modes:
* - default: Playwright (renders SPAs, follows redirects, sees innerText)
* - --fetch-mode: HTTP-only via fetch() (no JS, no browser; for batch workers)
*
* Liveness comes from `classifyLiveness` in liveness-core.mjs.
* Freshness comes from `classifyFreshness` in liveness-core.mjs.
*
* LinkedIn ToS: per CONTRIBUTING.md, fetch-mode never makes HTTP requests
* to linkedin.com. The LinkedIn URL ID heuristic catches old postings at
* zero network cost; recent LinkedIn URLs in fetch mode return `unverified`.
*
* Usage:
* node check-liveness.mjs <url1> [url2] ...
* node check-liveness.mjs --fetch-mode <url>
* node check-liveness.mjs --json <url>
* node check-liveness.mjs --classify <url>
* node check-liveness.mjs --file urls.txt
*
* Exit code: 0 if all active, 1 if any expired or uncertain
* Flags can combine: --fetch-mode --json --file urls.txt
*
* Exit code: 0 if all fresh+active, 1 if any stale/expired/uncertain
*/

import { chromium } from 'playwright';
import { readFile } from 'fs/promises';
import { classifyLiveness } from './liveness-core.mjs';
import {
classifyLiveness,
classifyFreshness,
extractPostingDate,
linkedinIdToYear,
ageInDays,
loadFreshnessConfig,
} from './liveness-core.mjs';

// ─────────────────────────────────────────────────────────────────────
// Playwright check (santifer's logic + freshness layer)
// ─────────────────────────────────────────────────────────────────────

async function checkUrlPlaywright(page, url, config) {
// LinkedIn ID heuristic — fast pre-filter, no network needed
const linkedinYear = linkedinIdToYear(url);
if (linkedinYear !== null) {
const ageYears = new Date().getFullYear() - linkedinYear;
if (ageYears >= 2) {
return {
url,
result: 'expired',
reason: `LinkedIn URL ID maps to ~${linkedinYear} (${ageYears}y old)`,
datePosted: null,
ageInDays: ageYears * 365,
freshness: 'expired',
};
}
}

async function checkUrl(page, url) {
try {
const response = await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 15000 });

const status = response?.status() ?? 0;

// Give SPAs (Ashby, Lever, Workday) time to hydrate
await page.waitForTimeout(2000);

const finalUrl = page.url();
const bodyText = await page.evaluate(() => document.body?.innerText ?? '');
const html = await page.content();

// Extract apply controls — santifer's improved logic that filters out
// nav/header/footer (fixes the Workday split-view false-positive bug)
const applyControls = await page.evaluate(() => {
const candidates = Array.from(
document.querySelectorAll('a, button, input[type="submit"], input[type="button"], [role="button"]')
Expand Down Expand Up @@ -62,55 +104,239 @@ async function checkUrl(page, url) {
.filter(Boolean);
});

return classifyLiveness({ status, finalUrl, bodyText, applyControls });
// Liveness from santifer's classifier
const liveness = classifyLiveness({ status, finalUrl, bodyText, applyControls });

// Freshness layer — extract date from rendered HTML
const datePosted = extractPostingDate(html);
const days = ageInDays(datePosted);
const freshness = classifyFreshness(datePosted, config);
const dateStr = datePosted?.toISOString().slice(0, 10) ?? null;

// If liveness says expired, that wins (URL is dead, age is moot)
if (liveness.result === 'expired') {
return { url, ...liveness, datePosted: dateStr, ageInDays: days, freshness: 'expired' };
}

// If freshness says expired, override active liveness — too old to bother
if (freshness === 'expired') {
return {
url,
result: 'expired',
reason: `posting is ${days}d old (max: ${config.max_age_days})`,
datePosted: dateStr,
ageInDays: days,
freshness,
};
}

return { url, ...liveness, datePosted: dateStr, ageInDays: days, freshness };

} catch (err) {
return { result: 'expired', reason: `navigation error: ${err.message.split('\n')[0]}` };
return {
url,
result: 'expired',
reason: `navigation error: ${err.message.split('\n')[0]}`,
datePosted: null,
ageInDays: null,
freshness: 'expired',
};
}
}

async function main() {
const args = process.argv.slice(2);
// ─────────────────────────────────────────────────────────────────────
// Fetch-mode check (no Playwright; for batch workers)
// ─────────────────────────────────────────────────────────────────────

async function checkUrlFetch(url, config) {
// LinkedIn ToS guard — never hit linkedin.com directly
if (/linkedin\.com\//i.test(url)) {
const linkedinYear = linkedinIdToYear(url);
if (linkedinYear !== null) {
const ageYears = new Date().getFullYear() - linkedinYear;
if (ageYears >= 2) {
return {
url,
result: 'expired',
reason: `LinkedIn URL ID maps to ~${linkedinYear} (${ageYears}y old)`,
datePosted: null,
ageInDays: ageYears * 365,
freshness: 'expired',
};
}
}
// Recent LinkedIn URL — heuristic doesn't catch it. Per CONTRIBUTING.md
// we don't fetch LinkedIn directly. Return uncertain so the caller can
// decide (Playwright path is OK if user runs it interactively).
return {
url,
result: 'uncertain',
reason: 'LinkedIn fetch blocked by ToS — use Playwright mode if needed',
datePosted: null,
ageInDays: null,
freshness: 'unverified',
};
}

try {
const response = await fetch(url, {
redirect: 'follow',
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
'Accept-Language': 'en-US,en;q=0.5',
},
signal: AbortSignal.timeout(15000),
});

const status = response.status;
const finalUrl = response.url;
const html = await response.text();

const datePosted = extractPostingDate(html);
const days = ageInDays(datePosted);
const freshness = classifyFreshness(datePosted, config);
const dateStr = datePosted?.toISOString().slice(0, 10) ?? null;

// Strip HTML tags for liveness pattern matching
const bodyText = html
.replace(/<script[\s\S]*?<\/script>/gi, ' ')
.replace(/<style[\s\S]*?<\/style>/gi, ' ')
.replace(/<[^>]+>/g, ' ')
.replace(/\s+/g, ' ')
.trim();

// Strongest positive: JSON-LD datePosted is present + fresh ──
// ATS platforms only embed JSON-LD when the job is live. SPAs (Ashby,
// Lever, Workday) have minimal stripped text but rich JSON-LD payloads,
// so this short-circuit avoids false "insufficient content" rejections.
if (datePosted) {
if (freshness === 'expired') {
return { url, result: 'expired', reason: `posting is ${days}d old (max: ${config.max_age_days})`, datePosted: dateStr, ageInDays: days, freshness };
}
return { url, result: 'active', reason: `JSON-LD datePosted (${days}d old)`, datePosted: dateStr, ageInDays: days, freshness };
}

// No date — fall back to liveness classifier with empty applyControls
// (fetch-mode can't reliably extract them without rendering)
const liveness = classifyLiveness({
status,
finalUrl,
bodyText,
applyControls: [], // fetch-mode has no DOM, so no apply controls
});

if (args.length === 0) {
console.error('Usage: node check-liveness.mjs <url1> [url2] ...');
console.error(' node check-liveness.mjs --file urls.txt');
return { url, ...liveness, datePosted: null, ageInDays: null, freshness };

} catch (err) {
return {
url,
result: 'expired',
reason: `fetch error: ${err.message.split('\n')[0]}`,
datePosted: null,
ageInDays: null,
freshness: 'expired',
};
}
}

// ─────────────────────────────────────────────────────────────────────
// Output formatters
// ─────────────────────────────────────────────────────────────────────

function formatHuman(r) {
const icon = { active: '✅', expired: '❌', uncertain: '⚠️' }[r.result] ?? '?';
const ageStr = r.ageInDays != null ? ` (${r.ageInDays}d old)` : '';
console.log(`${icon} ${r.result.padEnd(10)} ${r.url}${ageStr}`);
if (r.result !== 'active' || r.freshness === 'stale') {
console.log(` ${r.reason}`);
}
}

function formatJson(r) {
console.log(JSON.stringify(r));
}

function formatClassify(r) {
console.log(r.freshness);
}

// ─────────────────────────────────────────────────────────────────────
// CLI
// ─────────────────────────────────────────────────────────────────────

async function main() {
const argv = process.argv.slice(2);
if (argv.length === 0) {
console.error('Usage: node check-liveness.mjs [--fetch-mode] [--json|--classify] <url1> [url2] ...');
console.error(' node check-liveness.mjs [--fetch-mode] [--json|--classify] --file urls.txt');
process.exit(1);
}

const fetchMode = argv.includes('--fetch-mode');
const jsonOut = argv.includes('--json');
const classifyOut = argv.includes('--classify');
const fileIdx = argv.indexOf('--file');

let urls;
if (args[0] === '--file') {
const text = await readFile(args[1], 'utf-8');
if (fileIdx !== -1) {
const text = await readFile(argv[fileIdx + 1], 'utf-8');
urls = text.split('\n').map(l => l.trim()).filter(l => l && !l.startsWith('#'));
} else {
urls = args;
urls = argv.filter((a, i) => !a.startsWith('--') && i !== fileIdx + 1);
}

if (urls.length === 0) {
console.error('No URLs provided.');
process.exit(1);
}

console.log(`Checking ${urls.length} URL(s)...\n`);
const config = loadFreshnessConfig();
const formatter = jsonOut ? formatJson : classifyOut ? formatClassify : formatHuman;

const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
if (!jsonOut && !classifyOut) {
console.log(`Checking ${urls.length} URL(s) [${fetchMode ? 'fetch-mode' : 'playwright'}]...\n`);
}

let active = 0, expired = 0, uncertain = 0;
let active = 0, expired = 0, uncertain = 0, stale = 0;

// Sequential — project rule: never Playwright in parallel
for (const url of urls) {
const { result, reason } = await checkUrl(page, url);
const icon = { active: '✅', expired: '❌', uncertain: '⚠️' }[result];
console.log(`${icon} ${result.padEnd(10)} ${url}`);
if (result !== 'active') console.log(` ${reason}`);
if (result === 'active') active++;
else if (result === 'expired') expired++;
else uncertain++;
if (fetchMode) {
// Fetch mode — parallel-safe (no shared browser state)
const results = await Promise.all(urls.map(u => checkUrlFetch(u, config)));
for (const r of results) {
formatter(r);
if (r.result === 'active' && r.freshness !== 'stale') active++;
else if (r.result === 'expired' || r.freshness === 'expired') expired++;
else if (r.freshness === 'stale') stale++;
else uncertain++;
}
} else {
// Playwright mode — sequential per project rule
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
for (const url of urls) {
const r = await checkUrlPlaywright(page, url, config);
formatter(r);
if (r.result === 'active' && r.freshness !== 'stale') active++;
else if (r.result === 'expired' || r.freshness === 'expired') expired++;
else if (r.freshness === 'stale') stale++;
else uncertain++;
}
await browser.close();
}

await browser.close();
if (!jsonOut && !classifyOut) {
console.log(`\nResults: ${active} active ${stale} stale ${expired} expired ${uncertain} uncertain`);
console.log(`Freshness thresholds: warn=${config.warn_age_days}d, max=${config.max_age_days}d`);
}

console.log(`\nResults: ${active} active ${expired} expired ${uncertain} uncertain`);
if (expired > 0 || uncertain > 0) process.exit(1);
if (expired > 0 || uncertain > 0 || stale > 0) process.exit(1);
}

main().catch(err => {
console.error('Fatal:', err.message);
process.exit(1);
});
// Only run main() when invoked directly, not when imported by tests
if (import.meta.url === `file://${process.argv[1]}`) {
main().catch(err => {
console.error('Fatal:', err.message);
process.exit(1);
});
}
34 changes: 34 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,40 @@ Scripts maintain data consistency:
| `dedup-tracker.mjs` | Removes duplicate entries by company+role |
| `normalize-statuses.mjs` | Maps status aliases to canonical values |
| `cv-sync-check.mjs` | Validates setup consistency |
| `check-liveness.mjs` | Liveness + freshness check (see below) |

## Freshness Filtering

Both `scan` and `pipeline` modes call `check-liveness.mjs` to filter out stale or expired job postings before they consume evaluation tokens.

**Two execution modes:**
- **Playwright** (default): renders SPAs, follows redirects, sees `innerText`. Use when running scan interactively.
- **`--fetch-mode`**: HTTP-only via `fetch()`. No JS execution, no browser. Use in batch workers (`claude -p`) where Playwright is unavailable. JSON-LD payloads are embedded server-side on Greenhouse, Ashby, and Lever, so fetch-mode catches dates on those platforms.

**Detection signals (priority order):**
1. **LinkedIn URL ID heuristic** — sequential job IDs leak posting year. Catches 2-year-old postings at zero network cost.
2. **JSON-LD `datePosted`** — embedded by all major ATS platforms. Survives WebFetch summarization.
3. **Inline `"datePosted":"..."` patterns** — for minified embeds outside JSON-LD blocks.
4. **Visible text patterns** — `Posted on YYYY-MM-DD`, `Posted on Aug 15, 2025`, `Posted N days ago`, etc.
5. **Greenhouse `?error=true` redirect** — definitive closed-job signal.
6. **Body text patterns** — "no longer accepting applications", "position has been filled", etc.

**Classification:**
- `fresh`: age ≤ `warn_age_days` (default 30d)
- `stale`: `warn_age_days` < age ≤ `max_age_days` (default 60d) — still evaluated but Red Flags penalty applies
- `expired`: age > `max_age_days` — pipeline.md skips entirely with `SKIPPED_STALE` minimal report
- `unverified`: no date found AND `require_date: true` — treated as expired in strict mode

**Configuration:** `freshness:` block in `portals.yml`. See `docs/CUSTOMIZATION.md` for tuning guidance.

**CLI:**
```bash
node check-liveness.mjs --fetch-mode --json <url> # structured output
node check-liveness.mjs --fetch-mode --classify <url> # just "fresh|stale|expired"
node check-liveness.mjs <url> # interactive Playwright mode
```

**Tests:** `node test-freshness.mjs` (40 unit tests; runs as part of `node test-all.mjs`).

## Dashboard TUI

Expand Down
Loading
Loading