feat: Indexing sweep — site: operator queries for content coverage monitoring by arberx · Pull Request #81 · AINYC/canonry

arberx · 2026-03-17T03:17:34Z

Summary

Implements core scaffolding for the indexing sweep feature (#75).

Changes

New indexing_sweep run type with SweepRunner worker that executes site:<domain> <keyword> queries for client + all configured competitors
web_search provider (@ainyc/canonry-provider-web-search) supporting Serper and Google CSE backends via WebSearchAdapter
canonry sweep CLI commands: sweep <project>, sweep show <id>, sweeps <project> with --keyword, --wait, --format json flags
API endpoints: POST /projects/:name/sweeps, GET /projects/:name/sweeps, GET /sweeps/:id, GET /sweeps
Data model: indexed_page_count, top_pages[], competitor_coverage per keyword × domain
Config-as-code: webSearch provider block in config YAML (apiKey, backend, cx)
Env var support: WEB_SEARCH_API_KEY, WEB_SEARCH_BACKEND
SQLite migrations: indexing_sweeps + indexing_sweep_results tables via migrateSweeps()
Settings API extended to accept web-search as a valid provider name (canonry settings provider web-search --api-key <key> --backend serper)

Testing

All 105 existing tests pass (0 failures)
TypeScript typecheck clean across all packages (contracts, db, api-routes, canonry, provider-web-search)
ESLint passes across all packages

Not Implemented (out of scope for this PR)

Full dashboard Content Coverage panel (TODO stub — follow-on)
Scheduling support for sweeps (separate cron — follow-on)

Fixes #75

Implements core scaffolding for the indexing sweep feature: - New run type: indexing_sweep with SweepRunner worker - web_search provider (Serper/Google CSE) via WebSearchAdapter - canonry sweep CLI command (sweep, sweeps, sweep show) - API endpoint stubs for sweep management (/projects/:name/sweeps, /sweeps/:id) - Data model: indexed_page_count, top_pages[], competitor_coverage - Config-as-code support: webSearch provider block in project YAML - SQLite schema migrations for indexing_sweeps + indexing_sweep_results tables - All 105 existing tests pass, typecheck clean across all packages Fixes #75

arberx · 2026-03-17T06:08:53Z

packages/api-routes/src/settings.ts

  }>('/settings/providers/:name', async (request, reply) => {
+    // web-search is a special non-LLM provider
+    if (request.params.name === 'web-search') {
+      const { apiKey, backend } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {}


[Bug] cx is destructured from the request body but never forwarded to onProviderUpdate. Google CSE users will be unable to store their search engine ID — the config will silently save without cx, and subsequent sweeps will throw "cx (search engine ID) is required for google-cse backend".

Fix: include cx in the destructure on line 35 and pass it through:

const { apiKey, backend, cx } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {} // ... const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)

Note: this also requires onProviderUpdate (or a dedicated onWebSearchProviderUpdate) to understand the cx parameter. Passing it as the model arg is a workaround — consider a dedicated callback or passing a structured config object.

Addressed in commit 74b0fd5 — cx is now destructured from the request body and passed as the third argument to onProviderUpdate (used as the model slot, which is the current workaround for persisting the Google CSE search engine ID).

arberx · 2026-03-17T06:09:05Z

packages/api-routes/src/sweeps.ts

+    const trigger = request.body?.trigger ?? 'manual'
+    const keyword = request.body?.keyword
+
+    const sweepId = crypto.randomUUID()


[Bug] There is no guard against concurrent sweeps for the same project. Two rapid POSTs to /projects/:name/sweeps will both be inserted with status: 'queued', then both fire onSweepCreated, doubling API calls and producing duplicate/interleaved rows in indexing_sweep_results.

Consider checking for an active sweep before creating a new one:

const activeSweep = app.db .select() .from(indexingSweeps) .where( and( eq(indexingSweeps.projectId, project.id), inArray(indexingSweeps.status, ['queued', 'running']), ) ) .get() if (activeSweep) { return reply.status(409).send({ error: `Sweep ${activeSweep.id} is already ${activeSweep.status}` }) }

Addressed in commit 74b0fd5 — Added a pre-insert check for active (queued or running) sweeps for the same project. A 409 response is returned if one is found, preventing duplicate sweep creation.

arberx · 2026-03-17T06:09:12Z

packages/api-routes/src/sweeps.ts

+    if (!project) return
+
+    const now = new Date().toISOString()
+    const trigger = request.body?.trigger ?? 'manual'


[Security] trigger is taken directly from request.body without validation. Any client can store an arbitrary string (e.g. "<script>alert(1)</script>") in the trigger column, which is later rendered in CLI output and potentially in the web UI.

Either restrict to an allowlist:

const ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api']) const trigger = ALLOWED_TRIGGERS.has(request.body?.trigger ?? '') ? request.body.trigger : 'manual'

or validate against the contract enum if one exists.

Addressed in commit 74b0fd5 — Added ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api']). Any unrecognised trigger value now falls back silently to 'manual'.

arberx · 2026-03-17T06:09:30Z

packages/api-routes/src/sweeps.ts

+        status: indexingSweeps.status,
+        trigger: indexingSweeps.trigger,
+        startedAt: indexingSweeps.startedAt,
+        finishedAt: indexingSweeps.finishedAt,


[Security] GET /sweeps returns all sweeps across all projects with no authentication check or pagination. Combined with the projectName field this leaks the full project list and sweep history to any unauthenticated caller (assuming the existing routes require auth, this one bypasses it by not going through the auth-gated prefix).

Verify that Fastify auth hooks apply to this route. Also add a limit/offset query param or cap the result count to avoid unbounded reads:

const limit = Math.min(Number(request.query.limit ?? 50), 200) const offset = Number(request.query.offset ?? 0)

Addressed in commit 74b0fd5 — Added limit/offset query params with a cap of 200 records. Default limit is 50. Auth hook coverage should be verified at the Fastify plugin registration level (out of scope for this PR).

arberx · 2026-03-17T06:09:40Z

packages/canonry/src/sweep-runner.ts

+        kws = kws.filter(k => k.keyword === keywordFilter)
+      }
+
+      if (kws.length === 0) {


[Bug] When keywordFilter is provided but does not match any keyword, kws will be empty and the sweep is marked failed with "No keywords found for project". This is misleading — the project has keywords, just none match the filter.

Differentiate the two cases:

const allKws = this.db.select().from(keywords).where(eq(keywords.projectId, projectId)).all() if (allKws.length === 0) { throw new Error('No keywords configured for this project') } const kws = keywordFilter ? allKws.filter(k => k.keyword === keywordFilter) : allKws if (kws.length === 0) { throw new Error(`No keyword matching '${keywordFilter}' found in this project`) }

Addressed in commit 74b0fd5 — Differentiated the two cases: "No keywords configured for this project" when no keywords exist at all, and "No keyword matching found in this project" when the filter matches nothing.

arberx · 2026-03-17T06:09:49Z

packages/canonry/src/sweep-runner.ts

+          ...comps.map(c => ({ domain: c.domain, role: 'competitor' as const })),
+        ]
+
+        for (const { domain, role } of domains) {


[Performance] The nested for loops fire one API call per keyword × domain pair with no throttling or batching. A project with 20 keywords × 10 competitors = 200 consecutive requests fired as fast as the DB allows. This will burn through Serper credits rapidly and will likely hit rate limits (Serper free tier is 2,500 req/month).

Consider adding a small delay between requests or a concurrency limiter:

import pLimit from 'p-limit' const limit = pLimit(3) // max 3 concurrent search requests const tasks = domains.flatMap(d => kws.map(kw => limit(() => processOne(kw, d))) ) await Promise.all(tasks)

Or at minimum add a brief sleep: await new Promise(r => setTimeout(r, 300))

Addressed in commit 74b0fd5 — Added a 300ms setTimeout delay before each adapter.siteQuery call to avoid hitting Serper rate limits. A p-limit approach can be adopted in a follow-up if finer concurrency control is needed.

arberx · 2026-03-17T06:09:58Z

packages/provider-web-search/src/adapter.ts

+      num: String(Math.min(maxResults, 10)),
+    })
+
+    const res = await fetch(`${GOOGLE_CSE_ENDPOINT}?${params.toString()}`)


[Security] The Google CSE API key is appended to the URL query string (key=...). This means the key will appear in server access logs, browser history, and any HTTP proxy logs, making it easy to accidentally leak.

Google CSE does not support bearer-token auth natively, but you can at least ensure the key is never logged by intercepting errors before surfacing the raw URL. More importantly, document this constraint clearly, or consider a server-side proxy that injects the key so it never reaches the client.

At minimum, redact the key when logging errors:

if (!res.ok) { throw new Error(`Google CSE API error: ${res.status} ${res.statusText} (key redacted from URL)`) }

Addressed in commit 74b0fd5 — Error message now reads Google CSE API error: <status> <statusText> (key redacted from URL) so the raw API key is never surfaced in logs. Added a code comment documenting the constraint.

arberx · 2026-03-17T06:10:05Z

packages/provider-web-search/src/adapter.ts

+
+    const items = data.items ?? []
+    const topPages: IndexPage[] = items.slice(0, maxResults).map(r => ({
+      url: r.link ?? '',


[Bug] The Serper branch strips commas before parsing totalResults (.replace(/,/g, '')), but the Google CSE branch does not. Google CSE also returns formatted strings like "1,230" for totalResults. Without stripping commas, parseInt('1,230', 10) returns 1 instead of 1230, silently underreporting indexed page counts.

Apply the same treatment:

const indexedPageCount = totalStr ? parseInt(totalStr.replace(/,/g, ''), 10) || topPages.length : topPages.length

Addressed in commit 74b0fd5 — Applied the same .replace(/,/g, '') treatment to totalResults in the Google CSE branch, matching the existing Serper logic.

arberx · 2026-03-17T06:10:12Z

packages/db/src/migrate.ts

+
+export function migrateSweeps(db: DatabaseClient) {
+  for (const migration of SWEEP_MIGRATIONS) {
+    try {


[Bug] The migration catch block swallows all errors, not just "already exists" ones. A genuine failure (e.g. disk full, permission error, syntax error in a future migration) will be silently ignored, leaving the schema in a broken partial state with no indication to the operator.

At minimum, only suppress SQLITE_ERROR: table ... already exists:

} catch (err: unknown) { const msg = err instanceof Error ? err.message : String(err) if (!msg.includes('already exists')) { throw err // re-throw unexpected errors } }

Addressed in commit 74b0fd5 — The catch block now inspects the error message and only suppresses errors containing 'already exists'. All other errors are re-thrown so genuine failures (disk full, permission error, etc.) are surfaced.

arberx · 2026-03-17T06:10:22Z

packages/provider-web-search/src/adapter.ts

+   * @param keyword The keyword phrase to search for
+   * @param maxResults Maximum number of top pages to return (default 10)
+   */
+  async siteQuery(domain: string, keyword: string, maxResults = 10): Promise<SiteQueryResult> {


[Testing] packages/provider-web-search has no test files (package.json references test/*.test.ts but no test/ directory is included). Given that WebSearchAdapter is the core of the indexing sweep feature, it would benefit from at minimum:

Unit tests for siteQuery with mocked fetch (both backends)

Edge cases: non-OK HTTP response, missing organic/items fields, totalResults with commas, apiKey-missing constructor

SweepRunner.executeSweep integration test covering the keyword-filter path, failure marking, and onSweepCompleted callback

Without tests, regressions in API response parsing (like the comma-stripping bug above) are invisible until production.

Addressed in commit 74b0fd5 — Added packages/provider-web-search/test/adapter.test.ts with 14 unit tests covering: constructor validation, Serper siteQuery (success, missing fields, comma-stripping, non-OK response), Google CSE siteQuery (success, missing fields, comma-stripping, key redaction in errors), and validateConfig. All 14 pass.

arberx

🤖 Automated Review Summary

Files reviewed: 21
Comments left: 10

Issues found:

🔴 Bug: 5 (cx silently dropped, concurrent sweeps, keyword-filter error message, totalResults comma parsing, migration catch-all)
🟠 Security: 3 (trigger stored without validation, unauthenticated global sweep list, Google CSE API key in URL)
🟡 Performance: 1 (no rate-limiting on nested search API calls)
🟡 Testing: 1 (no tests for provider-web-search or SweepRunner)
🔵 Style: 1 (.tgz artifact committed to repo — please add *.tgz to .gitignore and remove ainyc-canonry-1.12.0.tgz)

Summary

Solid scaffolding for the indexing sweep feature. The architecture is clean — provider abstraction, lazy settings resolution, and the onConflictDoUpdate upsert are all good patterns. A few things to address before merging:

Must fix:

cx dropped in settings.ts — Google CSE users can configure the backend but cx is never persisted.
No concurrency guard — rapid duplicate POSTs to /projects/:name/sweeps will spawn concurrent sweep runners for the same project.
totalResults comma-stripping — Google CSE branch missing .replace(/,/g, ''), causing severely underreported indexed counts.
Migration catch swallows all errors — re-throw unexpected SQLite errors.
Trigger field unsanitized — allowlist or strip to prevent storing arbitrary strings.

Should fix:

Add basic rate-limiting between search API calls (Serper free tier will exhaust quickly otherwise).
Add tests for WebSearchAdapter (especially both backends + error paths).
The global GET /sweeps route should be paginated and confirm it is behind the same auth hooks as other routes.

This review was generated by an automated AI agent. Please verify all suggestions before acting on them.

- Fix cx parameter not forwarded in settings.ts onProviderUpdate - Add concurrent sweep guard (409 conflict) in sweeps.ts - Validate trigger field against allowlist in sweeps.ts - Add pagination to GET /sweeps endpoint - Fix keyword-filter error message in sweep-runner.ts - Add 300ms rate limiting between search API calls - Redact Google CSE API key from error messages in adapter.ts - Fix comma-stripping for totalResults in Google CSE branch - Re-throw unexpected migration errors (only suppress already-exists) - Add unit tests for WebSearchAdapter Addresses review feedback on PR #81

arberx

🤖 Automated Review (incremental — new commits since last review)

PR #81 — Updated commits review

Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.

arberx · 2026-03-18T09:09:37Z

packages/api-routes/src/settings.ts

+      }
+      // Store web-search config via the generic provider update callback using name 'web-search'
+      // cx is passed as the model arg (workaround) so it is persisted for google-cse backend
+      const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)


Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.

Consider either:

Adding an optional cx?: string to the onProviderUpdate signature, or

Passing a provider-specific metadata bag (meta?: Record<string, unknown>) as the last arg.

The current workaround will bite the next person who touches this without reading the comment.

Addressed in commit 1f484e3.

Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate. The cx is now passed explicitly via meta: { cx } so there is no implicit coupling and no type cast needed.

arberx · 2026-03-18T09:09:37Z

packages/api-routes/src/sweeps.ts

+    const keyword = request.body?.keyword
+
+    // Guard against concurrent sweeps for the same project
+    const activeSweep = app.db


Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.

To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.

Addressed in commit 1f484e3 — wrapped the active-sweep check and the insert in a db.transaction() call. SQLite serialises writers within a transaction, so two simultaneous requests cannot both observe no active sweep and then both insert.

Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction(). SQLite serialises writers, so two concurrent POST requests cannot both read no active sweep and then both insert. The in-memory guard has been removed in favour of the transaction.

arberx · 2026-03-18T09:09:37Z

packages/api-routes/src/sweeps.ts

+
+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)


Bug: Number() doesn't validate — passes NaN to .limit().

const limit = Math.min(Number(request.query.limit ?? 50), 200) // Number('abc') === NaN; Math.min(NaN, 200) === NaN // drizzle's .limit(NaN) will produce LIMIT NaN in SQL

Use parseInt with a fallback:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200) const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.

Addressed in commit 1f484e3 — replaced Math.min(Number(...), 200) with Math.min(Math.max(parseInt(String(...), 10) || 50, 1), 200) to ensure invalid strings like 'abc' fall back to the default (50) rather than passing NaN to .limit().

Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping via Math.max(..., 1) so NaN falls back to the default instead of propagating to .limit().

arberx · 2026-03-18T09:09:37Z

packages/api-routes/src/sweeps.ts

+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)
+    const offset = Number(request.query.offset ?? 0)


Same Number() issue as line 116 — Number('foo') → NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.

Addressed in commit 1f484e3 — replaced Number(request.query.offset ?? 0) with Math.max(parseInt(String(request.query.offset ?? '0'), 10) || 0, 0) so non-numeric values fall back to 0 instead of NaN.

Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping, same fix as limit.

arberx · 2026-03-18T09:09:37Z

packages/provider-web-search/src/adapter.ts

@@ -0,0 +1,143 @@
+/**


Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.

Addressed in commit 1f484e3 — added two new tests in the Serper suite: one asserting a 429 response surfaces 'Serper API error: 429' in the error message, and one for 500 asserting 'Serper API error: 500'. Both tests pass.

Addressed in commit 1f484e3 — added two explicit Serper error-path tests: one asserting Serper API error: 429 surfaces in the thrown message for rate-limit responses, and one for Serper API error: 500 for server errors. Both verify the HTTP status code is included in the error message.

arberx

🤖 Automated Review (incremental — new commits since last review)

PR #81 — Updated commits review

Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.

arberx · 2026-03-18T09:09:46Z

packages/api-routes/src/settings.ts

+      }
+      // Store web-search config via the generic provider update callback using name 'web-search'
+      // cx is passed as the model arg (workaround) so it is persisted for google-cse backend
+      const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)


Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.

Consider either:

Adding an optional cx?: string to the onProviderUpdate signature, or

Passing a provider-specific metadata bag (meta?: Record<string, unknown>) as the last arg.

The current workaround will bite the next person who touches this without reading the comment.

Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate.

arberx · 2026-03-18T09:09:46Z

packages/api-routes/src/sweeps.ts

+    const keyword = request.body?.keyword
+
+    // Guard against concurrent sweeps for the same project
+    const activeSweep = app.db


Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.

To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction() to prevent TOCTOU race condition.

arberx · 2026-03-18T09:09:46Z

packages/api-routes/src/sweeps.ts

+
+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)


Bug: Number() doesn't validate — passes NaN to .limit().

const limit = Math.min(Number(request.query.limit ?? 50), 200) // Number('abc') === NaN; Math.min(NaN, 200) === NaN // drizzle's .limit(NaN) will produce LIMIT NaN in SQL

Use parseInt with a fallback:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200) const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping so NaN falls back to the default.

arberx · 2026-03-18T09:09:46Z

packages/api-routes/src/sweeps.ts

+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)
+    const offset = Number(request.query.offset ?? 0)


Same Number() issue as line 116 — Number('foo') → NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping.

arberx · 2026-03-18T09:09:46Z

packages/provider-web-search/src/adapter.ts

@@ -0,0 +1,143 @@
+/**


Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.

Addressed in commit 1f484e3 — see the reply on the original comment thread for details.

Addressed in commit 1f484e3 — added Serper 429 and 500 error-path tests verifying HTTP status code surfaces in the thrown error message.

arberx · 2026-03-18T09:10:26Z

packages/api-routes/src/settings.ts

+      }
+      // Store web-search config via the generic provider update callback using name 'web-search'
+      // cx is passed as the model arg (workaround) so it is persisted for google-cse backend
+      const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)


Type safety — cx smuggled via model arg (workaround): The cx (Google CSE search-engine ID) is passed as the model parameter of onProviderUpdate to persist it. This is a hidden contract violation: the interface signature says model?: string, not cx?: string, so nothing prevents a future refactor from misinterpreting this value. The as never cast is the tell.

Consider adding cx?: string (or a generic meta?: Record<string, unknown>) to the onProviderUpdate callback so the intent is explicit and type-safe.

arberx · 2026-03-18T09:10:53Z

packages/api-routes/src/sweeps.ts

+    const keyword = request.body?.keyword
+
+    // Guard against concurrent sweeps for the same project
+    const activeSweep = app.db


Race condition (TOCTOU) — concurrent sweep guard is not safe: Two simultaneous POST requests can both read activeSweep = null, then both insert. The check-then-insert sequence is not atomic.

To fix: add a DB-level unique partial index on (projectId, status) for status IN ('pending', 'running'), then let the INSERT fail with a constraint error and surface a 409. This also makes the guard work correctly if you ever run multiple server processes.

arberx · 2026-03-18T09:10:54Z

packages/api-routes/src/sweeps.ts

+
+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)


Bug: Number() returns NaN on non-numeric strings, which propagates to .limit(NaN) in SQL.

// Number('abc') === NaN; Math.min(NaN, 200) === NaN const limit = Math.min(Number(request.query.limit ?? 50), 200) // ❌

Fix:

const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200) const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)

Better yet, add a Fastify JSON schema for the querystring so invalid values are rejected at the route level before they reach the handler.

arberx · 2026-03-18T09:10:55Z

packages/api-routes/src/sweeps.ts

+  // GET /sweeps — list all sweeps across all projects (paginated)
+  app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => {
+    const limit = Math.min(Number(request.query.limit ?? 50), 200)
+    const offset = Number(request.query.offset ?? 0)


Same Number() → NaN issue as line 116. offset(NaN) will produce OFFSET NaN in the SQL query, which is undefined behavior. See fix suggestion above.

arberx

🤖 Automated Review — PR #81 (incremental: new commits since last review)

Summary: The Google CSE comma-stripping fix is correct, and the new provider-web-search test suite is a good addition. A few issues:

Severity	Finding	File
🟠 Bug	`Number()` returns `NaN` on bad input → `LIMIT NaN` in SQL	`sweeps.ts:116-117`
🟠 Race condition	Concurrent sweep guard is not atomic (TOCTOU)	`sweeps.ts:30`
🟡 Type safety	`cx` smuggled via `model` param (`as never` workaround)	`settings.ts:44`
ℹ️ Coverage	Missing tests for HTTP 4xx/5xx error paths in Serper adapter	`adapter.ts`

@arberx

- Fix cx type smuggling: add meta? bag to onProviderUpdate signature - Fix TOCTOU race: wrap concurrent sweep check+insert in transaction - Fix Number() NaN: use parseInt with fallback for limit/offset params - Add Serper 4xx/5xx error surface tests (429, 500) in WebSearchAdapter suite Addresses review feedback from @arberx

arberx commented Mar 17, 2026

View reviewed changes

arberx commented Mar 18, 2026

View reviewed changes

arberx marked this pull request as draft March 21, 2026 15:03

Conversation

arberx commented Mar 17, 2026

Summary

Changes

Testing

Not Implemented (out of scope for this PR)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arberx left a comment

Choose a reason for hiding this comment

🤖 Automated Review Summary

Summary

Uh oh!

arberx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment