-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
Connect a GitHub repo to a Canonry project so that site scans can cross-reference citation data with actual page content — producing file-level, citation-correlated recommendations and (in a later phase) opening PRs with fixes.
The adapter model is generic so WordPress, Shopify, and GitLab integrations slot in later without architectural changes.
From "you lost a citation" → "you lost a citation, and here's what to fix in src/app/blog/[slug]/page.tsx to win it back."
Scope: 4 Phases
Each phase is independently shippable.
| Phase | Description |
|---|---|
| 1 | GitHub PAT read-only repo scan + site-scan run kind |
| 2 | Page-keyword matching and citation-correlated recommendations |
| 3 | apply-fix for one narrow fix class (JSON-LD injection via PR) |
| 4 | GitHub App, draft-PR analysis, non-Git adapters (future, not scoped) |
Architecture
Package: packages/integration-github/
Follows the existing packages/integration-google/, packages/integration-bing/ convention. Depends on @ainyc/canonry-contracts only.
packages/integration-github/
package.json # @ainyc/canonry-integration-github
src/
index.ts # public exports
adapter.ts # GitHubSiteSourceAdapter
framework.ts # framework detection + file→URL routing
extract.ts # structured data / heading / meta extraction from source files
types.ts # adapter-specific types (NOT shared DTOs — those go in contracts)
test/
framework.test.ts
extract.test.ts
Shared types in packages/contracts/
// packages/contracts/src/site-source.ts
/** Generic adapter interface — GitHub is first impl, WordPress/Shopify/GitLab follow same shape */
interface SiteSourceAdapter {
name: string
displayName: string
healthcheck(config: SiteSourceConfig): Promise<{ ok: boolean; message: string }>
listPages(config: SiteSourceConfig): Promise<SitePage[]>
getPage(config: SiteSourceConfig, urlOrPath: string): Promise<SitePage | null>
}
/** Remediation is a separate, opt-in interface — not all adapters support it */
interface SiteSourceRemediator {
applyFix(config: SiteSourceConfig, recommendation: SiteRecommendationIntent): Promise<SiteFixResult>
}
interface SitePage {
url: string
path: string
title: string | null
content: string | null
structuredData: StructuredDataItem[]
headings: HeadingNode[]
metaDescription: string | null
lastModified: string | null
sourceRef: SourceReference | null
}
/** Discriminated by adapter type — Git has file paths, CMS has post IDs */
type SourceReference =
| { type: 'git-file'; filePath: string; lineStart?: number; lineEnd?: number }
| { type: 'cms-post'; identifier: string; editUrl?: string }
| { type: 'cms-product'; identifier: string; editUrl?: string }
/** Recommendation intent — enough to regenerate the fix at apply-time, not a stale patch */
interface SiteRecommendationIntent {
type: string
targetFile: string
schemaType?: string
content?: Record<string, unknown>
}
type SiteFixResult =
| { applied: true; prUrl: string }
| { applied: true; editUrl: string }
| { applied: false; error: string }Key design decisions
-
Scan vs remediation are separate interfaces.
SiteSourceAdapterhandles read-only scanning.SiteSourceRemediatorhandles fix application. CMS adapters won't produce unified diffs — they'll produce content updates via their own API. This avoids Git-centric fields in the base interface. -
Store intent, not patches. Recommendations store
SiteRecommendationIntent(what to fix, where, what schema type) plus thecommitShathe scan was against. Atapply-fixtime, the adapter generates the fix against the current repo state. This avoids stale patches in the DB as the repo moves. -
Dedicated endpoint, not the runs route. The current
POST /projects/:name/runshard-gates onkind !== 'answer-visibility'. Rather than plumbing a new callback throughRunRoutesOptions, createPOST /projects/:name/site-scan— mirroring how GSC sync and sitemap inspection have their own trigger endpoints with dedicated callbacks.
Run Kind Wiring: site-scan
This is not a drop-in. The current run route (packages/api-routes/src/runs.ts:26) hard-gates on kind !== 'answer-visibility'. The job runner in server.ts:375 only handles answer-visibility. Cloud API (apps/api/src/app.ts) doesn't wire run callbacks at all.
Approach: Follow the GSC sync pattern (google.ts → server.ts:333):
- Route handler creates a run record with
kind: 'site-scan' - Calls
opts.onSiteScanRequested(runId, projectId)callback server.tswires that callback toexecuteSiteScan()
Files:
packages/contracts/src/run.ts:7— add'site-scan'torunKindSchemapackages/api-routes/src/site-source.ts(new) — route plugin with callbackpackages/api-routes/src/index.ts:143+— register pluginpackages/canonry/src/server.ts:325+— wire callback
Auth & Storage Model
Secrets in ~/.canonry/config.yaml, non-secret config in DB. Follows existing Google/Bing pattern.
# ~/.canonry/config.yaml
github:
token: ghp_xxxxxxxxxxxx # PAT with repo read accesssiteConnections DB table stores only non-secret config (adapter name, repo, branch). Connection store follows the Bing pattern in server.ts:194-235.
No split-brain:
- Auth credentials:
~/.canonry/config.yaml(source of truth) - Connection config (which repo, which branch):
siteConnectionstable - Scan results:
sitePages,siteRecommendationstables
Database Schema
Three new tables in packages/db/src/schema.ts with corresponding migrations in migrate.ts:
site_connections— 1 per project: adapter, config JSON (repo, branch), timestamps. Unique index onproject_id.site_pages— per-scan page snapshots: url, path, title, structured_data_types, headings, meta_description, has_schema, source_ref, commit_sha. Indexed on project_id and scan_run_id.site_recommendations— per-scan recommendations: page_url, keyword_id, type, severity, title, description, source_ref, intent JSON, commit_sha, status, pr_url. Indexed on project_id, scan_run_id, and (project_id, status).
GitHub Adapter: Framework Detection
This is the hard part. Repo file trees don't reliably map to live URLs for dynamic routes, rewrites, or locales.
MVP: narrow framework support + explicit mapping fallback
| Framework | Detection | Content paths | Route mapping |
|---|---|---|---|
| Next.js (app) | next.config.* + app/ |
app/**/page.{tsx,jsx,mdx} |
Dir path = URL |
| Next.js (pages) | next.config.* + pages/ |
pages/**/*.{tsx,jsx} |
File path = URL |
| Hugo | hugo.toml/config.toml |
content/**/*.md |
Front matter slug or dir path |
| Astro | astro.config.* |
src/pages/**/*.{astro,md,mdx} |
File path = URL |
| Plain HTML | Fallback | **/*.html |
File path = URL |
Explicitly NOT handled in MVP: dynamic routes with data fetching, rewrites/redirects, i18n routing, generateStaticParams.
Explicit mapping config for when auto-detection fails:
spec:
siteSource:
adapter: github
repo: myorg/my-site
contentPaths:
- glob: "content/blog/**/*.md"
urlPrefix: "/blog/"Structured data extraction — pattern-match only, never execute:
- Regex
<script type="application/ld+json">blocks - YAML front matter in MD/MDX
- Heading tags /
#headings <meta name="description">tags
Recommendation Engine (Phase 2)
Available data
querySnapshots stores: citationState, citedDomains, competitorOverlap, answerText, groundingSources. It does not store competitor page content or structured data.
Correlation logic
For each keyword:
- Find best-matching page (URL path terms, title/h1 match)
- If not cited → check what the page is missing (schema, meta, headings)
- If no matching page → content gap
- Competitor comparison limited to "competitor X is cited, you are not" — structured data comparison requires optional page fetch via existing
site-fetch.ts
Recommendation types
| Type | Trigger | Severity |
|---|---|---|
content-gap |
Not-cited keyword, no matching page | High |
missing-schema |
Page targets keyword but lacks relevant schema | High |
competitor-advantage |
Competitor cited, yours isn't | High |
weak-headings |
Keyword missing from h1/h2 | Medium |
no-meta-description |
Page has no meta description | Medium |
API Endpoints
| Method | Path | Phase |
|---|---|---|
PUT |
/projects/:name/site-source |
1 |
GET |
/projects/:name/site-source |
1 |
DELETE |
/projects/:name/site-source |
1 |
POST |
/projects/:name/site-source/healthcheck |
1 |
POST |
/projects/:name/site-scan |
1 |
GET |
/projects/:name/pages |
1 |
GET |
/projects/:name/recommendations |
2 |
PATCH |
/projects/:name/recommendations/:id |
2 |
POST |
/projects/:name/recommendations/:id/apply |
3 |
Plus: OpenAPI catalog entries, API client methods, route index registration.
CLI Commands
# Phase 1
canonry site connect <project> --adapter github --repo owner/repo [--branch main]
canonry site disconnect <project>
canonry site status <project>
canonry site scan <project> [--wait] [--format json]
canonry site pages <project> [--format json]
# Phase 2
canonry site recommendations <project> [--severity high] [--status open] [--format json]
# Phase 3
canonry site apply-fix <project> <recommendation-id>New files: commands/site.ts, cli-commands/site.ts. Register in cli-commands.ts.
Config-as-Code Round-Trip
Adding siteSource to configSpecSchema also requires changes to:
packages/api-routes/src/apply.ts— handle in apply body, upsertsiteConnectionspackages/api-routes/src/projects.ts— include in export responsepackages/contracts/src/project.ts— add toprojectDtoSchema
What's Explicitly Out of Scope
- Pre-publish analysis on draft PRs — requires ref-specific scanning + GitHub App/check-run integration
- Competitor structured data comparison — optional enrichment using
site-fetch.ts, not a requirement - Cloud mode wiring —
apps/api/andapps/worker/don't wire run callbacks today - Non-Git adapters — interface supports them, no code written
Implementation Steps
Phase 1: Read-only GitHub scan (19 steps)
packages/contracts/src/site-source.ts(new) — shared typespackages/contracts/src/run.ts— add'site-scan'packages/contracts/src/config-schema.ts— addsiteSourcepackages/contracts/src/project.ts— add to DTOpackages/integration-github/(new package) — adapterpackages/db/src/schema.ts—siteConnections,sitePagestablespackages/db/src/migrate.ts— migrationspackages/api-routes/src/site-source.ts(new) — route pluginpackages/api-routes/src/index.ts— register routespackages/api-routes/src/openapi.ts— endpoint docspackages/api-routes/src/apply.ts— handlesiteSourcepackages/api-routes/src/projects.ts— exportsiteSourcepackages/canonry/src/client.ts— API client methodspackages/canonry/src/site-scan.ts(new) — execution functionpackages/canonry/src/server.ts— connection store + callbackpackages/canonry/src/commands/site.ts(new) — CLI handlerpackages/canonry/src/cli-commands/site.ts(new) — CLI specspackages/canonry/src/cli-commands.ts— register- Tests: framework detection, file→URL mapping, extraction, API endpoints
Phase 2: Recommendations (7 steps)
packages/db/src/schema.ts—siteRecommendationstable + migrationpackages/canonry/src/site-scan.ts— recommendation enginepackages/api-routes/src/site-source.ts— recommendation endpointspackages/canonry/src/client.ts— recommendation methodspackages/canonry/src/commands/site.ts—recommendationssubcommand- Tests: recommendation generation with mock data
Phase 3: Apply-fix (4 steps)
packages/integration-github/src/remediate.ts—SiteSourceRemediatorpackages/api-routes/src/site-source.ts— apply endpointpackages/canonry/src/commands/site.ts—apply-fixsubcommand- Tests: patch generation
Reusable Existing Code
packages/canonry/src/site-fetch.ts— SSRF-safe HTML fetchpackages/api-routes/src/run-queue.ts—queueRunIfProjectIdle()packages/api-routes/src/helpers.ts—resolveProject(),writeAuditLog()server.ts:194-235— Bing connection store patternserver.ts:333-358— GSC sync callback wiring pattern