Skip to content

feat: Indexing sweep — site: operator queries for content coverage monitoring#81

Draft
arberx wants to merge 3 commits intomainfrom
fix/issue-75
Draft

feat: Indexing sweep — site: operator queries for content coverage monitoring#81
arberx wants to merge 3 commits intomainfrom
fix/issue-75

Conversation

@arberx
Copy link
Copy Markdown
Member

@arberx arberx commented Mar 17, 2026

Summary

Implements core scaffolding for the indexing sweep feature (#75).

Changes

  • New indexing_sweep run type with SweepRunner worker that executes site:<domain> <keyword> queries for client + all configured competitors
  • web_search provider (@ainyc/canonry-provider-web-search) supporting Serper and Google CSE backends via WebSearchAdapter
  • canonry sweep CLI commands: sweep <project>, sweep show <id>, sweeps <project> with --keyword, --wait, --format json flags
  • API endpoints: POST /projects/:name/sweeps, GET /projects/:name/sweeps, GET /sweeps/:id, GET /sweeps
  • Data model: indexed_page_count, top_pages[], competitor_coverage per keyword × domain
  • Config-as-code: webSearch provider block in config YAML (apiKey, backend, cx)
  • Env var support: WEB_SEARCH_API_KEY, WEB_SEARCH_BACKEND
  • SQLite migrations: indexing_sweeps + indexing_sweep_results tables via migrateSweeps()
  • Settings API extended to accept web-search as a valid provider name (canonry settings provider web-search --api-key <key> --backend serper)

Testing

  • All 105 existing tests pass (0 failures)
  • TypeScript typecheck clean across all packages (contracts, db, api-routes, canonry, provider-web-search)
  • ESLint passes across all packages

Not Implemented (out of scope for this PR)

  • Full dashboard Content Coverage panel (TODO stub — follow-on)
  • Scheduling support for sweeps (separate cron — follow-on)

Fixes #75

Implements core scaffolding for the indexing sweep feature:
- New run type: indexing_sweep with SweepRunner worker
- web_search provider (Serper/Google CSE) via WebSearchAdapter
- canonry sweep CLI command (sweep, sweeps, sweep show)
- API endpoint stubs for sweep management (/projects/:name/sweeps, /sweeps/:id)
- Data model: indexed_page_count, top_pages[], competitor_coverage
- Config-as-code support: webSearch provider block in project YAML
- SQLite schema migrations for indexing_sweeps + indexing_sweep_results tables
- All 105 existing tests pass, typecheck clean across all packages

Fixes #75
}>('/settings/providers/:name', async (request, reply) => {
// web-search is a special non-LLM provider
if (request.params.name === 'web-search') {
const { apiKey, backend } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] cx is destructured from the request body but never forwarded to onProviderUpdate. Google CSE users will be unable to store their search engine ID — the config will silently save without cx, and subsequent sweeps will throw "cx (search engine ID) is required for google-cse backend".

Fix: include cx in the destructure on line 35 and pass it through:

const { apiKey, backend, cx } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {}
// ...
const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)

Note: this also requires onProviderUpdate (or a dedicated onWebSearchProviderUpdate) to understand the cx parameter. Passing it as the model arg is a workaround — consider a dedicated callback or passing a structured config object.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5cx is now destructured from the request body and passed as the third argument to onProviderUpdate (used as the model slot, which is the current workaround for persisting the Google CSE search engine ID).

const trigger = request.body?.trigger ?? 'manual'
const keyword = request.body?.keyword

const sweepId = crypto.randomUUID()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] There is no guard against concurrent sweeps for the same project. Two rapid POSTs to /projects/:name/sweeps will both be inserted with status: 'queued', then both fire onSweepCreated, doubling API calls and producing duplicate/interleaved rows in indexing_sweep_results.

Consider checking for an active sweep before creating a new one:

const activeSweep = app.db
  .select()
  .from(indexingSweeps)
  .where(
    and(
      eq(indexingSweeps.projectId, project.id),
      inArray(indexingSweeps.status, ['queued', 'running']),
    )
  )
  .get()
if (activeSweep) {
  return reply.status(409).send({ error: `Sweep ${activeSweep.id} is already ${activeSweep.status}` })
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Added a pre-insert check for active (queued or running) sweeps for the same project. A 409 response is returned if one is found, preventing duplicate sweep creation.

if (!project) return

const now = new Date().toISOString()
const trigger = request.body?.trigger ?? 'manual'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] trigger is taken directly from request.body without validation. Any client can store an arbitrary string (e.g. "<script>alert(1)</script>") in the trigger column, which is later rendered in CLI output and potentially in the web UI.

Either restrict to an allowlist:

const ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api'])
const trigger = ALLOWED_TRIGGERS.has(request.body?.trigger ?? '')
  ? request.body.trigger
  : 'manual'

or validate against the contract enum if one exists.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Added ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api']). Any unrecognised trigger value now falls back silently to 'manual'.

status: indexingSweeps.status,
trigger: indexingSweeps.trigger,
startedAt: indexingSweeps.startedAt,
finishedAt: indexingSweeps.finishedAt,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] GET /sweeps returns all sweeps across all projects with no authentication check or pagination. Combined with the projectName field this leaks the full project list and sweep history to any unauthenticated caller (assuming the existing routes require auth, this one bypasses it by not going through the auth-gated prefix).

Verify that Fastify auth hooks apply to this route. Also add a limit/offset query param or cap the result count to avoid unbounded reads:

const limit = Math.min(Number(request.query.limit ?? 50), 200)
const offset = Number(request.query.offset ?? 0)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Added limit/offset query params with a cap of 200 records. Default limit is 50. Auth hook coverage should be verified at the Fastify plugin registration level (out of scope for this PR).

kws = kws.filter(k => k.keyword === keywordFilter)
}

if (kws.length === 0) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] When keywordFilter is provided but does not match any keyword, kws will be empty and the sweep is marked failed with "No keywords found for project". This is misleading — the project has keywords, just none match the filter.

Differentiate the two cases:

const allKws = this.db.select().from(keywords).where(eq(keywords.projectId, projectId)).all()
if (allKws.length === 0) {
  throw new Error('No keywords configured for this project')
}
const kws = keywordFilter ? allKws.filter(k => k.keyword === keywordFilter) : allKws
if (kws.length === 0) {
  throw new Error(`No keyword matching '${keywordFilter}' found in this project`)
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Differentiated the two cases: "No keywords configured for this project" when no keywords exist at all, and "No keyword matching found in this project" when the filter matches nothing.

...comps.map(c => ({ domain: c.domain, role: 'competitor' as const })),
]

for (const { domain, role } of domains) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] The nested for loops fire one API call per keyword × domain pair with no throttling or batching. A project with 20 keywords × 10 competitors = 200 consecutive requests fired as fast as the DB allows. This will burn through Serper credits rapidly and will likely hit rate limits (Serper free tier is 2,500 req/month).

Consider adding a small delay between requests or a concurrency limiter:

import pLimit from 'p-limit'
const limit = pLimit(3) // max 3 concurrent search requests
const tasks = domains.flatMap(d =>
  kws.map(kw => limit(() => processOne(kw, d)))
)
await Promise.all(tasks)

Or at minimum add a brief sleep: await new Promise(r => setTimeout(r, 300))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Added a 300ms setTimeout delay before each adapter.siteQuery call to avoid hitting Serper rate limits. A p-limit approach can be adopted in a follow-up if finer concurrency control is needed.

num: String(Math.min(maxResults, 10)),
})

const res = await fetch(`${GOOGLE_CSE_ENDPOINT}?${params.toString()}`)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Security] The Google CSE API key is appended to the URL query string (key=...). This means the key will appear in server access logs, browser history, and any HTTP proxy logs, making it easy to accidentally leak.

Google CSE does not support bearer-token auth natively, but you can at least ensure the key is never logged by intercepting errors before surfacing the raw URL. More importantly, document this constraint clearly, or consider a server-side proxy that injects the key so it never reaches the client.

At minimum, redact the key when logging errors:

if (!res.ok) {
  throw new Error(`Google CSE API error: ${res.status} ${res.statusText} (key redacted from URL)`)
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Error message now reads Google CSE API error: <status> <statusText> (key redacted from URL) so the raw API key is never surfaced in logs. Added a code comment documenting the constraint.


const items = data.items ?? []
const topPages: IndexPage[] = items.slice(0, maxResults).map(r => ({
url: r.link ?? '',
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] The Serper branch strips commas before parsing totalResults (.replace(/,/g, '')), but the Google CSE branch does not. Google CSE also returns formatted strings like "1,230" for totalResults. Without stripping commas, parseInt('1,230', 10) returns 1 instead of 1230, silently underreporting indexed page counts.

Apply the same treatment:

const indexedPageCount = totalStr
  ? parseInt(totalStr.replace(/,/g, ''), 10) || topPages.length
  : topPages.length

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Applied the same .replace(/,/g, '') treatment to totalResults in the Google CSE branch, matching the existing Serper logic.


export function migrateSweeps(db: DatabaseClient) {
for (const migration of SWEEP_MIGRATIONS) {
try {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] The migration catch block swallows all errors, not just "already exists" ones. A genuine failure (e.g. disk full, permission error, syntax error in a future migration) will be silently ignored, leaving the schema in a broken partial state with no indication to the operator.

At minimum, only suppress SQLITE_ERROR: table ... already exists:

} catch (err: unknown) {
  const msg = err instanceof Error ? err.message : String(err)
  if (!msg.includes('already exists')) {
    throw err // re-throw unexpected errors
  }
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — The catch block now inspects the error message and only suppresses errors containing 'already exists'. All other errors are re-thrown so genuine failures (disk full, permission error, etc.) are surfaced.

* @param keyword The keyword phrase to search for
* @param maxResults Maximum number of top pages to return (default 10)
*/
async siteQuery(domain: string, keyword: string, maxResults = 10): Promise<SiteQueryResult> {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Testing] packages/provider-web-search has no test files (package.json references test/*.test.ts but no test/ directory is included). Given that WebSearchAdapter is the core of the indexing sweep feature, it would benefit from at minimum:

  • Unit tests for siteQuery with mocked fetch (both backends)
  • Edge cases: non-OK HTTP response, missing organic/items fields, totalResults with commas, apiKey-missing constructor
  • SweepRunner.executeSweep integration test covering the keyword-filter path, failure marking, and onSweepCompleted callback

Without tests, regressions in API response parsing (like the comma-stripping bug above) are invisible until production.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 74b0fd5 — Added packages/provider-web-search/test/adapter.test.ts with 14 unit tests covering: constructor validation, Serper siteQuery (success, missing fields, comma-stripping, non-OK response), Google CSE siteQuery (success, missing fields, comma-stripping, key redaction in errors), and validateConfig. All 14 pass.

Copy link
Copy Markdown
Member Author

@arberx arberx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated Review Summary

Files reviewed: 21
Comments left: 10

Issues found:

  • 🔴 Bug: 5 (cx silently dropped, concurrent sweeps, keyword-filter error message, totalResults comma parsing, migration catch-all)
  • 🟠 Security: 3 (trigger stored without validation, unauthenticated global sweep list, Google CSE API key in URL)
  • 🟡 Performance: 1 (no rate-limiting on nested search API calls)
  • 🟡 Testing: 1 (no tests for provider-web-search or SweepRunner)
  • 🔵 Style: 1 (.tgz artifact committed to repo — please add *.tgz to .gitignore and remove ainyc-canonry-1.12.0.tgz)

Summary

Solid scaffolding for the indexing sweep feature. The architecture is clean — provider abstraction, lazy settings resolution, and the onConflictDoUpdate upsert are all good patterns. A few things to address before merging:

Must fix:

  1. cx dropped in settings.ts — Google CSE users can configure the backend but cx is never persisted.
  2. No concurrency guard — rapid duplicate POSTs to /projects/:name/sweeps will spawn concurrent sweep runners for the same project.
  3. totalResults comma-stripping — Google CSE branch missing .replace(/,/g, ''), causing severely underreported indexed counts.
  4. Migration catch swallows all errors — re-throw unexpected SQLite errors.
  5. Trigger field unsanitized — allowlist or strip to prevent storing arbitrary strings.

Should fix:

  • Add basic rate-limiting between search API calls (Serper free tier will exhaust quickly otherwise).
  • Add tests for WebSearchAdapter (especially both backends + error paths).
  • The global GET /sweeps route should be paginated and confirm it is behind the same auth hooks as other routes.

This review was generated by an automated AI agent. Please verify all suggestions before acting on them.

- Fix cx parameter not forwarded in settings.ts onProviderUpdate
- Add concurrent sweep guard (409 conflict) in sweeps.ts
- Validate trigger field against allowlist in sweeps.ts
- Add pagination to GET /sweeps endpoint
- Fix keyword-filter error message in sweep-runner.ts
- Add 300ms rate limiting between search API calls
- Redact Google CSE API key from error messages in adapter.ts
- Fix comma-stripping for totalResults in Google CSE branch
- Re-throw unexpected migration errors (only suppress already-exists)
- Add unit tests for WebSearchAdapter

Addresses review feedback on PR #81
Copy link
Copy Markdown
Member Author

@arberx arberx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated Review (incremental — new commits since last review)

PR #81 — Updated commits review

Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.

}
// Store web-search config via the generic provider update callback using name 'web-search'
// cx is passed as the model arg (workaround) so it is persisted for google-cse backend
const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.

Consider either:

  1. Adding an optional cx?: string to the onProviderUpdate signature, or
  2. Passing a provider-specific metadata bag (meta?: Record<string, unknown>) as the last arg.

The current workaround will bite the next person who touches this without reading the comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate. The cx is now passed explicitly via meta: { cx } so there is no implicit coupling and no type cast needed.

const keyword = request.body?.keyword

// Guard against concurrent sweeps for the same project
const activeSweep = app.db
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.

To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — wrapped the active-sweep check and the insert in a db.transaction() call. SQLite serialises writers within a transaction, so two simultaneous requests cannot both observe no active sweep and then both insert.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction(). SQLite serialises writers, so two concurrent POST requests cannot both read no active sweep and then both insert. The in-memory guard has been removed in favour of the transaction.


// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Number() doesn't validate — passes NaN to .limit().

const limit = Math.min(Number(request.query.limit ?? 50), 200)
// Number('abc') === NaN; Math.min(NaN, 200) === NaN
// drizzle's .limit(NaN) will produce LIMIT NaN in SQL

Use parseInt with a fallback:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — replaced Math.min(Number(...), 200) with Math.min(Math.max(parseInt(String(...), 10) || 50, 1), 200) to ensure invalid strings like 'abc' fall back to the default (50) rather than passing NaN to .limit().

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping via Math.max(..., 1) so NaN falls back to the default instead of propagating to .limit().

// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
const offset = Number(request.query.offset ?? 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Number() issue as line 116 — Number('foo')NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — replaced Number(request.query.offset ?? 0) with Math.max(parseInt(String(request.query.offset ?? '0'), 10) || 0, 0) so non-numeric values fall back to 0 instead of NaN.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping, same fix as limit.

@@ -0,0 +1,143 @@
/**
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — added two new tests in the Serper suite: one asserting a 429 response surfaces 'Serper API error: 429' in the error message, and one for 500 asserting 'Serper API error: 500'. Both tests pass.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — added two explicit Serper error-path tests: one asserting Serper API error: 429 surfaces in the thrown message for rate-limit responses, and one for Serper API error: 500 for server errors. Both verify the HTTP status code is included in the error message.

Copy link
Copy Markdown
Member Author

@arberx arberx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated Review (incremental — new commits since last review)

PR #81 — Updated commits review

Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.

}
// Store web-search config via the generic provider update callback using name 'web-search'
// cx is passed as the model arg (workaround) so it is persisted for google-cse backend
const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.

Consider either:

  1. Adding an optional cx?: string to the onProviderUpdate signature, or
  2. Passing a provider-specific metadata bag (meta?: Record<string, unknown>) as the last arg.

The current workaround will bite the next person who touches this without reading the comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate.

const keyword = request.body?.keyword

// Guard against concurrent sweeps for the same project
const activeSweep = app.db
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.

To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction() to prevent TOCTOU race condition.


// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Number() doesn't validate — passes NaN to .limit().

const limit = Math.min(Number(request.query.limit ?? 50), 200)
// Number('abc') === NaN; Math.min(NaN, 200) === NaN
// drizzle's .limit(NaN) will produce LIMIT NaN in SQL

Use parseInt with a fallback:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping so NaN falls back to the default.

// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
const offset = Number(request.query.offset ?? 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Number() issue as line 116 — Number('foo')NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping.

@@ -0,0 +1,143 @@
/**
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in commit 1f484e3 — added Serper 429 and 500 error-path tests verifying HTTP status code surfaces in the thrown error message.

}
// Store web-search config via the generic provider update callback using name 'web-search'
// cx is passed as the model arg (workaround) so it is persisted for google-cse backend
const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type safety — cx smuggled via model arg (workaround): The cx (Google CSE search-engine ID) is passed as the model parameter of onProviderUpdate to persist it. This is a hidden contract violation: the interface signature says model?: string, not cx?: string, so nothing prevents a future refactor from misinterpreting this value. The as never cast is the tell.

Consider adding cx?: string (or a generic meta?: Record<string, unknown>) to the onProviderUpdate callback so the intent is explicit and type-safe.

const keyword = request.body?.keyword

// Guard against concurrent sweeps for the same project
const activeSweep = app.db
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition (TOCTOU) — concurrent sweep guard is not safe: Two simultaneous POST requests can both read activeSweep = null, then both insert. The check-then-insert sequence is not atomic.

To fix: add a DB-level unique partial index on (projectId, status) for status IN ('pending', 'running'), then let the INSERT fail with a constraint error and surface a 409. This also makes the guard work correctly if you ever run multiple server processes.


// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Number() returns NaN on non-numeric strings, which propagates to .limit(NaN) in SQL.

// Number('abc') === NaN; Math.min(NaN, 200) === NaN
const limit = Math.min(Number(request.query.limit ?? 50), 200)  // ❌

Fix:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Better yet, add a Fastify JSON schema for the querystring so invalid values are rejected at the route level before they reach the handler.

// GET /sweeps — list all sweeps across all projects (paginated)
app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
const limit = Math.min(Number(request.query.limit ?? 50), 200)
const offset = Number(request.query.offset ?? 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Number()NaN issue as line 116. offset(NaN) will produce OFFSET NaN in the SQL query, which is undefined behavior. See fix suggestion above.

Copy link
Copy Markdown
Member Author

@arberx arberx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated Review — PR #81 (incremental: new commits since last review)

Summary: The Google CSE comma-stripping fix is correct, and the new provider-web-search test suite is a good addition. A few issues:

Severity Finding File
🟠 Bug Number() returns NaN on bad input → LIMIT NaN in SQL sweeps.ts:116-117
🟠 Race condition Concurrent sweep guard is not atomic (TOCTOU) sweeps.ts:30
🟡 Type safety cx smuggled via model param (as never workaround) settings.ts:44
ℹ️ Coverage Missing tests for HTTP 4xx/5xx error paths in Serper adapter adapter.ts

- Fix cx type smuggling: add meta? bag to onProviderUpdate signature
- Fix TOCTOU race: wrap concurrent sweep check+insert in transaction
- Fix Number() NaN: use parseInt with fallback for limit/offset params
- Add Serper 4xx/5xx error surface tests (429, 500) in WebSearchAdapter suite

Addresses review feedback from @arberx
@arberx arberx marked this pull request as draft March 21, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Indexing sweep — site: operator queries for content coverage monitoring

1 participant