feat: Indexing sweep — site: operator queries for content coverage monitoring#81
feat: Indexing sweep — site: operator queries for content coverage monitoring#81
Conversation
Implements core scaffolding for the indexing sweep feature: - New run type: indexing_sweep with SweepRunner worker - web_search provider (Serper/Google CSE) via WebSearchAdapter - canonry sweep CLI command (sweep, sweeps, sweep show) - API endpoint stubs for sweep management (/projects/:name/sweeps, /sweeps/:id) - Data model: indexed_page_count, top_pages[], competitor_coverage - Config-as-code support: webSearch provider block in project YAML - SQLite schema migrations for indexing_sweeps + indexing_sweep_results tables - All 105 existing tests pass, typecheck clean across all packages Fixes #75
packages/api-routes/src/settings.ts
Outdated
| }>('/settings/providers/:name', async (request, reply) => { | ||
| // web-search is a special non-LLM provider | ||
| if (request.params.name === 'web-search') { | ||
| const { apiKey, backend } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {} |
There was a problem hiding this comment.
[Bug] cx is destructured from the request body but never forwarded to onProviderUpdate. Google CSE users will be unable to store their search engine ID — the config will silently save without cx, and subsequent sweeps will throw "cx (search engine ID) is required for google-cse backend".
Fix: include cx in the destructure on line 35 and pass it through:
const { apiKey, backend, cx } = request.body as { apiKey?: string; backend?: string; cx?: string } ?? {}
// ...
const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined)Note: this also requires
onProviderUpdate(or a dedicatedonWebSearchProviderUpdate) to understand thecxparameter. Passing it as themodelarg is a workaround — consider a dedicated callback or passing a structured config object.
There was a problem hiding this comment.
Addressed in commit 74b0fd5 — cx is now destructured from the request body and passed as the third argument to onProviderUpdate (used as the model slot, which is the current workaround for persisting the Google CSE search engine ID).
| const trigger = request.body?.trigger ?? 'manual' | ||
| const keyword = request.body?.keyword | ||
|
|
||
| const sweepId = crypto.randomUUID() |
There was a problem hiding this comment.
[Bug] There is no guard against concurrent sweeps for the same project. Two rapid POSTs to /projects/:name/sweeps will both be inserted with status: 'queued', then both fire onSweepCreated, doubling API calls and producing duplicate/interleaved rows in indexing_sweep_results.
Consider checking for an active sweep before creating a new one:
const activeSweep = app.db
.select()
.from(indexingSweeps)
.where(
and(
eq(indexingSweeps.projectId, project.id),
inArray(indexingSweeps.status, ['queued', 'running']),
)
)
.get()
if (activeSweep) {
return reply.status(409).send({ error: `Sweep ${activeSweep.id} is already ${activeSweep.status}` })
}There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Added a pre-insert check for active (queued or running) sweeps for the same project. A 409 response is returned if one is found, preventing duplicate sweep creation.
packages/api-routes/src/sweeps.ts
Outdated
| if (!project) return | ||
|
|
||
| const now = new Date().toISOString() | ||
| const trigger = request.body?.trigger ?? 'manual' |
There was a problem hiding this comment.
[Security] trigger is taken directly from request.body without validation. Any client can store an arbitrary string (e.g. "<script>alert(1)</script>") in the trigger column, which is later rendered in CLI output and potentially in the web UI.
Either restrict to an allowlist:
const ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api'])
const trigger = ALLOWED_TRIGGERS.has(request.body?.trigger ?? '')
? request.body.trigger
: 'manual'or validate against the contract enum if one exists.
There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Added ALLOWED_TRIGGERS = new Set(['manual', 'scheduled', 'api']). Any unrecognised trigger value now falls back silently to 'manual'.
| status: indexingSweeps.status, | ||
| trigger: indexingSweeps.trigger, | ||
| startedAt: indexingSweeps.startedAt, | ||
| finishedAt: indexingSweeps.finishedAt, |
There was a problem hiding this comment.
[Security] GET /sweeps returns all sweeps across all projects with no authentication check or pagination. Combined with the projectName field this leaks the full project list and sweep history to any unauthenticated caller (assuming the existing routes require auth, this one bypasses it by not going through the auth-gated prefix).
Verify that Fastify auth hooks apply to this route. Also add a limit/offset query param or cap the result count to avoid unbounded reads:
const limit = Math.min(Number(request.query.limit ?? 50), 200)
const offset = Number(request.query.offset ?? 0)There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Added limit/offset query params with a cap of 200 records. Default limit is 50. Auth hook coverage should be verified at the Fastify plugin registration level (out of scope for this PR).
| kws = kws.filter(k => k.keyword === keywordFilter) | ||
| } | ||
|
|
||
| if (kws.length === 0) { |
There was a problem hiding this comment.
[Bug] When keywordFilter is provided but does not match any keyword, kws will be empty and the sweep is marked failed with "No keywords found for project". This is misleading — the project has keywords, just none match the filter.
Differentiate the two cases:
const allKws = this.db.select().from(keywords).where(eq(keywords.projectId, projectId)).all()
if (allKws.length === 0) {
throw new Error('No keywords configured for this project')
}
const kws = keywordFilter ? allKws.filter(k => k.keyword === keywordFilter) : allKws
if (kws.length === 0) {
throw new Error(`No keyword matching '${keywordFilter}' found in this project`)
}There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Differentiated the two cases: "No keywords configured for this project" when no keywords exist at all, and "No keyword matching found in this project" when the filter matches nothing.
| ...comps.map(c => ({ domain: c.domain, role: 'competitor' as const })), | ||
| ] | ||
|
|
||
| for (const { domain, role } of domains) { |
There was a problem hiding this comment.
[Performance] The nested for loops fire one API call per keyword × domain pair with no throttling or batching. A project with 20 keywords × 10 competitors = 200 consecutive requests fired as fast as the DB allows. This will burn through Serper credits rapidly and will likely hit rate limits (Serper free tier is 2,500 req/month).
Consider adding a small delay between requests or a concurrency limiter:
import pLimit from 'p-limit'
const limit = pLimit(3) // max 3 concurrent search requests
const tasks = domains.flatMap(d =>
kws.map(kw => limit(() => processOne(kw, d)))
)
await Promise.all(tasks)Or at minimum add a brief sleep: await new Promise(r => setTimeout(r, 300))
There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Added a 300ms setTimeout delay before each adapter.siteQuery call to avoid hitting Serper rate limits. A p-limit approach can be adopted in a follow-up if finer concurrency control is needed.
| num: String(Math.min(maxResults, 10)), | ||
| }) | ||
|
|
||
| const res = await fetch(`${GOOGLE_CSE_ENDPOINT}?${params.toString()}`) |
There was a problem hiding this comment.
[Security] The Google CSE API key is appended to the URL query string (key=...). This means the key will appear in server access logs, browser history, and any HTTP proxy logs, making it easy to accidentally leak.
Google CSE does not support bearer-token auth natively, but you can at least ensure the key is never logged by intercepting errors before surfacing the raw URL. More importantly, document this constraint clearly, or consider a server-side proxy that injects the key so it never reaches the client.
At minimum, redact the key when logging errors:
if (!res.ok) {
throw new Error(`Google CSE API error: ${res.status} ${res.statusText} (key redacted from URL)`)
}There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Error message now reads Google CSE API error: <status> <statusText> (key redacted from URL) so the raw API key is never surfaced in logs. Added a code comment documenting the constraint.
|
|
||
| const items = data.items ?? [] | ||
| const topPages: IndexPage[] = items.slice(0, maxResults).map(r => ({ | ||
| url: r.link ?? '', |
There was a problem hiding this comment.
[Bug] The Serper branch strips commas before parsing totalResults (.replace(/,/g, '')), but the Google CSE branch does not. Google CSE also returns formatted strings like "1,230" for totalResults. Without stripping commas, parseInt('1,230', 10) returns 1 instead of 1230, silently underreporting indexed page counts.
Apply the same treatment:
const indexedPageCount = totalStr
? parseInt(totalStr.replace(/,/g, ''), 10) || topPages.length
: topPages.lengthThere was a problem hiding this comment.
Addressed in commit 74b0fd5 — Applied the same .replace(/,/g, '') treatment to totalResults in the Google CSE branch, matching the existing Serper logic.
|
|
||
| export function migrateSweeps(db: DatabaseClient) { | ||
| for (const migration of SWEEP_MIGRATIONS) { | ||
| try { |
There was a problem hiding this comment.
[Bug] The migration catch block swallows all errors, not just "already exists" ones. A genuine failure (e.g. disk full, permission error, syntax error in a future migration) will be silently ignored, leaving the schema in a broken partial state with no indication to the operator.
At minimum, only suppress SQLITE_ERROR: table ... already exists:
} catch (err: unknown) {
const msg = err instanceof Error ? err.message : String(err)
if (!msg.includes('already exists')) {
throw err // re-throw unexpected errors
}
}There was a problem hiding this comment.
Addressed in commit 74b0fd5 — The catch block now inspects the error message and only suppresses errors containing 'already exists'. All other errors are re-thrown so genuine failures (disk full, permission error, etc.) are surfaced.
| * @param keyword The keyword phrase to search for | ||
| * @param maxResults Maximum number of top pages to return (default 10) | ||
| */ | ||
| async siteQuery(domain: string, keyword: string, maxResults = 10): Promise<SiteQueryResult> { |
There was a problem hiding this comment.
[Testing] packages/provider-web-search has no test files (package.json references test/*.test.ts but no test/ directory is included). Given that WebSearchAdapter is the core of the indexing sweep feature, it would benefit from at minimum:
- Unit tests for
siteQuerywith mockedfetch(both backends) - Edge cases: non-OK HTTP response, missing
organic/itemsfields,totalResultswith commas,apiKey-missing constructor SweepRunner.executeSweepintegration test covering the keyword-filter path, failure marking, andonSweepCompletedcallback
Without tests, regressions in API response parsing (like the comma-stripping bug above) are invisible until production.
There was a problem hiding this comment.
Addressed in commit 74b0fd5 — Added packages/provider-web-search/test/adapter.test.ts with 14 unit tests covering: constructor validation, Serper siteQuery (success, missing fields, comma-stripping, non-OK response), Google CSE siteQuery (success, missing fields, comma-stripping, key redaction in errors), and validateConfig. All 14 pass.
arberx
left a comment
There was a problem hiding this comment.
🤖 Automated Review Summary
Files reviewed: 21
Comments left: 10
Issues found:
- 🔴 Bug: 5 (cx silently dropped, concurrent sweeps, keyword-filter error message, totalResults comma parsing, migration catch-all)
- 🟠 Security: 3 (trigger stored without validation, unauthenticated global sweep list, Google CSE API key in URL)
- 🟡 Performance: 1 (no rate-limiting on nested search API calls)
- 🟡 Testing: 1 (no tests for
provider-web-searchorSweepRunner) - 🔵 Style: 1 (
.tgzartifact committed to repo — please add*.tgzto.gitignoreand removeainyc-canonry-1.12.0.tgz)
Summary
Solid scaffolding for the indexing sweep feature. The architecture is clean — provider abstraction, lazy settings resolution, and the onConflictDoUpdate upsert are all good patterns. A few things to address before merging:
Must fix:
cxdropped insettings.ts— Google CSE users can configure the backend but cx is never persisted.- No concurrency guard — rapid duplicate POSTs to
/projects/:name/sweepswill spawn concurrent sweep runners for the same project. totalResultscomma-stripping — Google CSE branch missing.replace(/,/g, ''), causing severely underreported indexed counts.- Migration catch swallows all errors — re-throw unexpected SQLite errors.
- Trigger field unsanitized — allowlist or strip to prevent storing arbitrary strings.
Should fix:
- Add basic rate-limiting between search API calls (Serper free tier will exhaust quickly otherwise).
- Add tests for
WebSearchAdapter(especially both backends + error paths). - The global
GET /sweepsroute should be paginated and confirm it is behind the same auth hooks as other routes.
This review was generated by an automated AI agent. Please verify all suggestions before acting on them.
- Fix cx parameter not forwarded in settings.ts onProviderUpdate - Add concurrent sweep guard (409 conflict) in sweeps.ts - Validate trigger field against allowlist in sweeps.ts - Add pagination to GET /sweeps endpoint - Fix keyword-filter error message in sweep-runner.ts - Add 300ms rate limiting between search API calls - Redact Google CSE API key from error messages in adapter.ts - Fix comma-stripping for totalResults in Google CSE branch - Re-throw unexpected migration errors (only suppress already-exists) - Add unit tests for WebSearchAdapter Addresses review feedback on PR #81
arberx
left a comment
There was a problem hiding this comment.
🤖 Automated Review (incremental — new commits since last review)
PR #81 — Updated commits review
Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.
packages/api-routes/src/settings.ts
Outdated
| } | ||
| // Store web-search config via the generic provider update callback using name 'web-search' | ||
| // cx is passed as the model arg (workaround) so it is persisted for google-cse backend | ||
| const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined) |
There was a problem hiding this comment.
Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.
Consider either:
- Adding an optional
cx?: stringto theonProviderUpdatesignature, or - Passing a provider-specific metadata bag (
meta?: Record<string, unknown>) as the last arg.
The current workaround will bite the next person who touches this without reading the comment.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate. The cx is now passed explicitly via meta: { cx } so there is no implicit coupling and no type cast needed.
packages/api-routes/src/sweeps.ts
Outdated
| const keyword = request.body?.keyword | ||
|
|
||
| // Guard against concurrent sweeps for the same project | ||
| const activeSweep = app.db |
There was a problem hiding this comment.
Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.
To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — wrapped the active-sweep check and the insert in a db.transaction() call. SQLite serialises writers within a transaction, so two simultaneous requests cannot both observe no active sweep and then both insert.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction(). SQLite serialises writers, so two concurrent POST requests cannot both read no active sweep and then both insert. The in-memory guard has been removed in favour of the transaction.
packages/api-routes/src/sweeps.ts
Outdated
|
|
||
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) |
There was a problem hiding this comment.
Bug: Number() doesn't validate — passes NaN to .limit().
const limit = Math.min(Number(request.query.limit ?? 50), 200)
// Number('abc') === NaN; Math.min(NaN, 200) === NaN
// drizzle's .limit(NaN) will produce LIMIT NaN in SQLUse parseInt with a fallback:
const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — replaced Math.min(Number(...), 200) with Math.min(Math.max(parseInt(String(...), 10) || 50, 1), 200) to ensure invalid strings like 'abc' fall back to the default (50) rather than passing NaN to .limit().
There was a problem hiding this comment.
Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping via Math.max(..., 1) so NaN falls back to the default instead of propagating to .limit().
packages/api-routes/src/sweeps.ts
Outdated
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) | ||
| const offset = Number(request.query.offset ?? 0) |
There was a problem hiding this comment.
Same Number() issue as line 116 — Number('foo') → NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — replaced Number(request.query.offset ?? 0) with Math.max(parseInt(String(request.query.offset ?? '0'), 10) || 0, 0) so non-numeric values fall back to 0 instead of NaN.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping, same fix as limit.
| @@ -0,0 +1,143 @@ | |||
| /** | |||
There was a problem hiding this comment.
Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — added two new tests in the Serper suite: one asserting a 429 response surfaces 'Serper API error: 429' in the error message, and one for 500 asserting 'Serper API error: 500'. Both tests pass.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — added two explicit Serper error-path tests: one asserting Serper API error: 429 surfaces in the thrown message for rate-limit responses, and one for Serper API error: 500 for server errors. Both verify the HTTP status code is included in the error message.
arberx
left a comment
There was a problem hiding this comment.
🤖 Automated Review (incremental — new commits since last review)
PR #81 — Updated commits review
Overall the new provider-web-search tests are thorough and the SSRF improvements in sitemap-parser (DNS resolution) are good. A few issues to address below.
packages/api-routes/src/settings.ts
Outdated
| } | ||
| // Store web-search config via the generic provider update callback using name 'web-search' | ||
| // cx is passed as the model arg (workaround) so it is persisted for google-cse backend | ||
| const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined) |
There was a problem hiding this comment.
Type safety / design smell: cx is being smuggled through the model parameter of onProviderUpdate as a workaround. This bypasses type checking (hence the as never cast) and creates an implicit coupling between caller and callee that isn't documented in the interface signature.
Consider either:
- Adding an optional
cx?: stringto theonProviderUpdatesignature, or - Passing a provider-specific metadata bag (
meta?: Record<string, unknown>) as the last arg.
The current workaround will bite the next person who touches this without reading the comment.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — replaced the as never workaround with meta?: Record<string, unknown> as the last arg to onProviderUpdate.
packages/api-routes/src/sweeps.ts
Outdated
| const keyword = request.body?.keyword | ||
|
|
||
| // Guard against concurrent sweeps for the same project | ||
| const activeSweep = app.db |
There was a problem hiding this comment.
Race condition (TOCTOU): This in-memory guard against concurrent sweeps has a classic check-then-act race. Two simultaneous POST requests can both read no active sweep, then both insert one. The guard only works in a single-threaded, single-process context without request queuing.
To make this reliable, add a DB-level unique constraint on (projectId, status) where status IN ('pending', 'running'), or wrap the check+insert in a transaction with INSERT OR IGNORE / ON CONFLICT. Even a single-process setup can hit this under moderate load.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — see the reply on the original comment thread for details.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — the check+insert is now wrapped in app.db.transaction() to prevent TOCTOU race condition.
packages/api-routes/src/sweeps.ts
Outdated
|
|
||
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) |
There was a problem hiding this comment.
Bug: Number() doesn't validate — passes NaN to .limit().
const limit = Math.min(Number(request.query.limit ?? 50), 200)
// Number('abc') === NaN; Math.min(NaN, 200) === NaN
// drizzle's .limit(NaN) will produce LIMIT NaN in SQLUse parseInt with a fallback:
const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)Also worth adding a Fastify JSON schema on the querystring so invalid values are rejected before they reach the handler.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — see the reply on the original comment thread for details.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — switched to parseInt(..., 10) || 50 with clamping so NaN falls back to the default.
packages/api-routes/src/sweeps.ts
Outdated
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) | ||
| const offset = Number(request.query.offset ?? 0) |
There was a problem hiding this comment.
Same Number() issue as line 116 — Number('foo') → NaN, and .offset(NaN) is undefined behaviour in SQL. See the fix suggestion on the line above.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — see the reply on the original comment thread for details.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — offset now uses parseInt(..., 10) || 0 with Math.max(..., 0) clamping.
| @@ -0,0 +1,143 @@ | |||
| /** | |||
There was a problem hiding this comment.
Missing test coverage for dangerouslyAllowPrivate flag path in the new WebSearchAdapter: The new test file covers happy-path and error cases well. One gap: there's no test asserting that requests to non-OK statuses (4xx/5xx) surface the HTTP status code in the error message for the Serper backend. The Google CSE test (key redacted from URL) already covers the error-leak concern — good work there.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — see the reply on the original comment thread for details.
There was a problem hiding this comment.
Addressed in commit 1f484e3 — added Serper 429 and 500 error-path tests verifying HTTP status code surfaces in the thrown error message.
packages/api-routes/src/settings.ts
Outdated
| } | ||
| // Store web-search config via the generic provider update callback using name 'web-search' | ||
| // cx is passed as the model arg (workaround) so it is persisted for google-cse backend | ||
| const result = opts.onProviderUpdate('web-search' as never, apiKey, cx, backend ?? 'serper', undefined) |
There was a problem hiding this comment.
Type safety — cx smuggled via model arg (workaround): The cx (Google CSE search-engine ID) is passed as the model parameter of onProviderUpdate to persist it. This is a hidden contract violation: the interface signature says model?: string, not cx?: string, so nothing prevents a future refactor from misinterpreting this value. The as never cast is the tell.
Consider adding cx?: string (or a generic meta?: Record<string, unknown>) to the onProviderUpdate callback so the intent is explicit and type-safe.
packages/api-routes/src/sweeps.ts
Outdated
| const keyword = request.body?.keyword | ||
|
|
||
| // Guard against concurrent sweeps for the same project | ||
| const activeSweep = app.db |
There was a problem hiding this comment.
Race condition (TOCTOU) — concurrent sweep guard is not safe: Two simultaneous POST requests can both read activeSweep = null, then both insert. The check-then-insert sequence is not atomic.
To fix: add a DB-level unique partial index on (projectId, status) for status IN ('pending', 'running'), then let the INSERT fail with a constraint error and surface a 409. This also makes the guard work correctly if you ever run multiple server processes.
packages/api-routes/src/sweeps.ts
Outdated
|
|
||
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) |
There was a problem hiding this comment.
Bug: Number() returns NaN on non-numeric strings, which propagates to .limit(NaN) in SQL.
// Number('abc') === NaN; Math.min(NaN, 200) === NaN
const limit = Math.min(Number(request.query.limit ?? 50), 200) // ❌Fix:
const limit = Math.min(Math.max(parseInt(request.query.limit ?? '50', 10) || 50, 1), 200)
const offset = Math.max(parseInt(request.query.offset ?? '0', 10) || 0, 0)Better yet, add a Fastify JSON schema for the querystring so invalid values are rejected at the route level before they reach the handler.
packages/api-routes/src/sweeps.ts
Outdated
| // GET /sweeps — list all sweeps across all projects (paginated) | ||
| app.get<{ Querystring: { limit?: string; offset?: string } }>('/sweeps', async (request, reply) => { | ||
| const limit = Math.min(Number(request.query.limit ?? 50), 200) | ||
| const offset = Number(request.query.offset ?? 0) |
There was a problem hiding this comment.
Same Number() → NaN issue as line 116. offset(NaN) will produce OFFSET NaN in the SQL query, which is undefined behavior. See fix suggestion above.
arberx
left a comment
There was a problem hiding this comment.
🤖 Automated Review — PR #81 (incremental: new commits since last review)
Summary: The Google CSE comma-stripping fix is correct, and the new provider-web-search test suite is a good addition. A few issues:
| Severity | Finding | File |
|---|---|---|
| 🟠 Bug | Number() returns NaN on bad input → LIMIT NaN in SQL |
sweeps.ts:116-117 |
| 🟠 Race condition | Concurrent sweep guard is not atomic (TOCTOU) | sweeps.ts:30 |
| 🟡 Type safety | cx smuggled via model param (as never workaround) |
settings.ts:44 |
| ℹ️ Coverage | Missing tests for HTTP 4xx/5xx error paths in Serper adapter | adapter.ts |
- Fix cx type smuggling: add meta? bag to onProviderUpdate signature - Fix TOCTOU race: wrap concurrent sweep check+insert in transaction - Fix Number() NaN: use parseInt with fallback for limit/offset params - Add Serper 4xx/5xx error surface tests (429, 500) in WebSearchAdapter suite Addresses review feedback from @arberx
Summary
Implements core scaffolding for the indexing sweep feature (#75).
Changes
indexing_sweeprun type withSweepRunnerworker that executessite:<domain> <keyword>queries for client + all configured competitorsweb_searchprovider (@ainyc/canonry-provider-web-search) supporting Serper and Google CSE backends viaWebSearchAdaptercanonry sweepCLI commands:sweep <project>,sweep show <id>,sweeps <project>with--keyword,--wait,--format jsonflagsPOST /projects/:name/sweeps,GET /projects/:name/sweeps,GET /sweeps/:id,GET /sweepsindexed_page_count,top_pages[],competitor_coverageper keyword × domainwebSearchprovider block in config YAML (apiKey,backend,cx)WEB_SEARCH_API_KEY,WEB_SEARCH_BACKENDindexing_sweeps+indexing_sweep_resultstables viamigrateSweeps()web-searchas a valid provider name (canonry settings provider web-search --api-key <key> --backend serper)Testing
contracts,db,api-routes,canonry,provider-web-search)Not Implemented (out of scope for this PR)
Fixes #75