Skip to content

feat: GitHub site integration with citation-correlated recommendations #152

@arberx

Description

@arberx

Summary

Connect a GitHub repo to a Canonry project so that site scans can cross-reference citation data with actual page content — producing file-level, citation-correlated recommendations and (in a later phase) opening PRs with fixes.

The adapter model is generic so WordPress, Shopify, and GitLab integrations slot in later without architectural changes.

From "you lost a citation" → "you lost a citation, and here's what to fix in src/app/blog/[slug]/page.tsx to win it back."


Scope: 4 Phases

Each phase is independently shippable.

Phase Description
1 GitHub PAT read-only repo scan + site-scan run kind
2 Page-keyword matching and citation-correlated recommendations
3 apply-fix for one narrow fix class (JSON-LD injection via PR)
4 GitHub App, draft-PR analysis, non-Git adapters (future, not scoped)

Architecture

Package: packages/integration-github/

Follows the existing packages/integration-google/, packages/integration-bing/ convention. Depends on @ainyc/canonry-contracts only.

packages/integration-github/
  package.json          # @ainyc/canonry-integration-github
  src/
    index.ts            # public exports
    adapter.ts          # GitHubSiteSourceAdapter
    framework.ts        # framework detection + file→URL routing
    extract.ts          # structured data / heading / meta extraction from source files
    types.ts            # adapter-specific types (NOT shared DTOs — those go in contracts)
  test/
    framework.test.ts
    extract.test.ts

Shared types in packages/contracts/

// packages/contracts/src/site-source.ts

/** Generic adapter interface — GitHub is first impl, WordPress/Shopify/GitLab follow same shape */
interface SiteSourceAdapter {
  name: string
  displayName: string
  healthcheck(config: SiteSourceConfig): Promise<{ ok: boolean; message: string }>
  listPages(config: SiteSourceConfig): Promise<SitePage[]>
  getPage(config: SiteSourceConfig, urlOrPath: string): Promise<SitePage | null>
}

/** Remediation is a separate, opt-in interface — not all adapters support it */
interface SiteSourceRemediator {
  applyFix(config: SiteSourceConfig, recommendation: SiteRecommendationIntent): Promise<SiteFixResult>
}

interface SitePage {
  url: string
  path: string
  title: string | null
  content: string | null
  structuredData: StructuredDataItem[]
  headings: HeadingNode[]
  metaDescription: string | null
  lastModified: string | null
  sourceRef: SourceReference | null
}

/** Discriminated by adapter type — Git has file paths, CMS has post IDs */
type SourceReference =
  | { type: 'git-file'; filePath: string; lineStart?: number; lineEnd?: number }
  | { type: 'cms-post'; identifier: string; editUrl?: string }
  | { type: 'cms-product'; identifier: string; editUrl?: string }

/** Recommendation intent — enough to regenerate the fix at apply-time, not a stale patch */
interface SiteRecommendationIntent {
  type: string
  targetFile: string
  schemaType?: string
  content?: Record<string, unknown>
}

type SiteFixResult =
  | { applied: true; prUrl: string }
  | { applied: true; editUrl: string }
  | { applied: false; error: string }

Key design decisions

  1. Scan vs remediation are separate interfaces. SiteSourceAdapter handles read-only scanning. SiteSourceRemediator handles fix application. CMS adapters won't produce unified diffs — they'll produce content updates via their own API. This avoids Git-centric fields in the base interface.

  2. Store intent, not patches. Recommendations store SiteRecommendationIntent (what to fix, where, what schema type) plus the commitSha the scan was against. At apply-fix time, the adapter generates the fix against the current repo state. This avoids stale patches in the DB as the repo moves.

  3. Dedicated endpoint, not the runs route. The current POST /projects/:name/runs hard-gates on kind !== 'answer-visibility'. Rather than plumbing a new callback through RunRoutesOptions, create POST /projects/:name/site-scan — mirroring how GSC sync and sitemap inspection have their own trigger endpoints with dedicated callbacks.


Run Kind Wiring: site-scan

This is not a drop-in. The current run route (packages/api-routes/src/runs.ts:26) hard-gates on kind !== 'answer-visibility'. The job runner in server.ts:375 only handles answer-visibility. Cloud API (apps/api/src/app.ts) doesn't wire run callbacks at all.

Approach: Follow the GSC sync pattern (google.tsserver.ts:333):

  1. Route handler creates a run record with kind: 'site-scan'
  2. Calls opts.onSiteScanRequested(runId, projectId) callback
  3. server.ts wires that callback to executeSiteScan()

Files:

  • packages/contracts/src/run.ts:7 — add 'site-scan' to runKindSchema
  • packages/api-routes/src/site-source.ts (new) — route plugin with callback
  • packages/api-routes/src/index.ts:143+ — register plugin
  • packages/canonry/src/server.ts:325+ — wire callback

Auth & Storage Model

Secrets in ~/.canonry/config.yaml, non-secret config in DB. Follows existing Google/Bing pattern.

# ~/.canonry/config.yaml
github:
  token: ghp_xxxxxxxxxxxx    # PAT with repo read access

siteConnections DB table stores only non-secret config (adapter name, repo, branch). Connection store follows the Bing pattern in server.ts:194-235.

No split-brain:

  • Auth credentials: ~/.canonry/config.yaml (source of truth)
  • Connection config (which repo, which branch): siteConnections table
  • Scan results: sitePages, siteRecommendations tables

Database Schema

Three new tables in packages/db/src/schema.ts with corresponding migrations in migrate.ts:

  • site_connections — 1 per project: adapter, config JSON (repo, branch), timestamps. Unique index on project_id.
  • site_pages — per-scan page snapshots: url, path, title, structured_data_types, headings, meta_description, has_schema, source_ref, commit_sha. Indexed on project_id and scan_run_id.
  • site_recommendations — per-scan recommendations: page_url, keyword_id, type, severity, title, description, source_ref, intent JSON, commit_sha, status, pr_url. Indexed on project_id, scan_run_id, and (project_id, status).

GitHub Adapter: Framework Detection

This is the hard part. Repo file trees don't reliably map to live URLs for dynamic routes, rewrites, or locales.

MVP: narrow framework support + explicit mapping fallback

Framework Detection Content paths Route mapping
Next.js (app) next.config.* + app/ app/**/page.{tsx,jsx,mdx} Dir path = URL
Next.js (pages) next.config.* + pages/ pages/**/*.{tsx,jsx} File path = URL
Hugo hugo.toml/config.toml content/**/*.md Front matter slug or dir path
Astro astro.config.* src/pages/**/*.{astro,md,mdx} File path = URL
Plain HTML Fallback **/*.html File path = URL

Explicitly NOT handled in MVP: dynamic routes with data fetching, rewrites/redirects, i18n routing, generateStaticParams.

Explicit mapping config for when auto-detection fails:

spec:
  siteSource:
    adapter: github
    repo: myorg/my-site
    contentPaths:
      - glob: "content/blog/**/*.md"
        urlPrefix: "/blog/"

Structured data extraction — pattern-match only, never execute:

  • Regex <script type="application/ld+json"> blocks
  • YAML front matter in MD/MDX
  • Heading tags / # headings
  • <meta name="description"> tags

Recommendation Engine (Phase 2)

Available data

querySnapshots stores: citationState, citedDomains, competitorOverlap, answerText, groundingSources. It does not store competitor page content or structured data.

Correlation logic

For each keyword:

  1. Find best-matching page (URL path terms, title/h1 match)
  2. If not cited → check what the page is missing (schema, meta, headings)
  3. If no matching page → content gap
  4. Competitor comparison limited to "competitor X is cited, you are not" — structured data comparison requires optional page fetch via existing site-fetch.ts

Recommendation types

Type Trigger Severity
content-gap Not-cited keyword, no matching page High
missing-schema Page targets keyword but lacks relevant schema High
competitor-advantage Competitor cited, yours isn't High
weak-headings Keyword missing from h1/h2 Medium
no-meta-description Page has no meta description Medium

API Endpoints

Method Path Phase
PUT /projects/:name/site-source 1
GET /projects/:name/site-source 1
DELETE /projects/:name/site-source 1
POST /projects/:name/site-source/healthcheck 1
POST /projects/:name/site-scan 1
GET /projects/:name/pages 1
GET /projects/:name/recommendations 2
PATCH /projects/:name/recommendations/:id 2
POST /projects/:name/recommendations/:id/apply 3

Plus: OpenAPI catalog entries, API client methods, route index registration.


CLI Commands

# Phase 1
canonry site connect <project> --adapter github --repo owner/repo [--branch main]
canonry site disconnect <project>
canonry site status <project>
canonry site scan <project> [--wait] [--format json]
canonry site pages <project> [--format json]

# Phase 2
canonry site recommendations <project> [--severity high] [--status open] [--format json]

# Phase 3
canonry site apply-fix <project> <recommendation-id>

New files: commands/site.ts, cli-commands/site.ts. Register in cli-commands.ts.


Config-as-Code Round-Trip

Adding siteSource to configSpecSchema also requires changes to:

  • packages/api-routes/src/apply.ts — handle in apply body, upsert siteConnections
  • packages/api-routes/src/projects.ts — include in export response
  • packages/contracts/src/project.ts — add to projectDtoSchema

What's Explicitly Out of Scope

  • Pre-publish analysis on draft PRs — requires ref-specific scanning + GitHub App/check-run integration
  • Competitor structured data comparison — optional enrichment using site-fetch.ts, not a requirement
  • Cloud mode wiringapps/api/ and apps/worker/ don't wire run callbacks today
  • Non-Git adapters — interface supports them, no code written

Implementation Steps

Phase 1: Read-only GitHub scan (19 steps)

  1. packages/contracts/src/site-source.ts (new) — shared types
  2. packages/contracts/src/run.ts — add 'site-scan'
  3. packages/contracts/src/config-schema.ts — add siteSource
  4. packages/contracts/src/project.ts — add to DTO
  5. packages/integration-github/ (new package) — adapter
  6. packages/db/src/schema.tssiteConnections, sitePages tables
  7. packages/db/src/migrate.ts — migrations
  8. packages/api-routes/src/site-source.ts (new) — route plugin
  9. packages/api-routes/src/index.ts — register routes
  10. packages/api-routes/src/openapi.ts — endpoint docs
  11. packages/api-routes/src/apply.ts — handle siteSource
  12. packages/api-routes/src/projects.ts — export siteSource
  13. packages/canonry/src/client.ts — API client methods
  14. packages/canonry/src/site-scan.ts (new) — execution function
  15. packages/canonry/src/server.ts — connection store + callback
  16. packages/canonry/src/commands/site.ts (new) — CLI handler
  17. packages/canonry/src/cli-commands/site.ts (new) — CLI specs
  18. packages/canonry/src/cli-commands.ts — register
  19. Tests: framework detection, file→URL mapping, extraction, API endpoints

Phase 2: Recommendations (7 steps)

  1. packages/db/src/schema.tssiteRecommendations table + migration
  2. packages/canonry/src/site-scan.ts — recommendation engine
  3. packages/api-routes/src/site-source.ts — recommendation endpoints
  4. packages/canonry/src/client.ts — recommendation methods
  5. packages/canonry/src/commands/site.tsrecommendations subcommand
  6. Tests: recommendation generation with mock data

Phase 3: Apply-fix (4 steps)

  1. packages/integration-github/src/remediate.tsSiteSourceRemediator
  2. packages/api-routes/src/site-source.ts — apply endpoint
  3. packages/canonry/src/commands/site.tsapply-fix subcommand
  4. Tests: patch generation

Reusable Existing Code

  • packages/canonry/src/site-fetch.ts — SSRF-safe HTML fetch
  • packages/api-routes/src/run-queue.tsqueueRunIfProjectIdle()
  • packages/api-routes/src/helpers.tsresolveProject(), writeAuditLog()
  • server.ts:194-235 — Bing connection store pattern
  • server.ts:333-358 — GSC sync callback wiring pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions