feat: GitHub site integration with citation-correlated recommendations

## Summary

Connect a GitHub repo to a Canonry project so that site scans can cross-reference citation data with actual page content — producing file-level, citation-correlated recommendations and (in a later phase) opening PRs with fixes.

The adapter model is generic so WordPress, Shopify, and GitLab integrations slot in later without architectural changes.

**From "you lost a citation" → "you lost a citation, and here's what to fix in `src/app/blog/[slug]/page.tsx` to win it back."**

---

## Scope: 4 Phases

Each phase is independently shippable.

| Phase | Description |
|-------|-------------|
| **1** | GitHub PAT read-only repo scan + `site-scan` run kind |
| **2** | Page-keyword matching and citation-correlated recommendations |
| **3** | `apply-fix` for one narrow fix class (JSON-LD injection via PR) |
| **4** | GitHub App, draft-PR analysis, non-Git adapters *(future, not scoped)* |

---

## Architecture

### Package: `packages/integration-github/`

Follows the existing `packages/integration-google/`, `packages/integration-bing/` convention. Depends on `@ainyc/canonry-contracts` only.

```
packages/integration-github/
  package.json          # @ainyc/canonry-integration-github
  src/
    index.ts            # public exports
    adapter.ts          # GitHubSiteSourceAdapter
    framework.ts        # framework detection + file→URL routing
    extract.ts          # structured data / heading / meta extraction from source files
    types.ts            # adapter-specific types (NOT shared DTOs — those go in contracts)
  test/
    framework.test.ts
    extract.test.ts
```

### Shared types in `packages/contracts/`

```typescript
// packages/contracts/src/site-source.ts

/** Generic adapter interface — GitHub is first impl, WordPress/Shopify/GitLab follow same shape */
interface SiteSourceAdapter {
  name: string
  displayName: string
  healthcheck(config: SiteSourceConfig): Promise<{ ok: boolean; message: string }>
  listPages(config: SiteSourceConfig): Promise<SitePage[]>
  getPage(config: SiteSourceConfig, urlOrPath: string): Promise<SitePage | null>
}

/** Remediation is a separate, opt-in interface — not all adapters support it */
interface SiteSourceRemediator {
  applyFix(config: SiteSourceConfig, recommendation: SiteRecommendationIntent): Promise<SiteFixResult>
}

interface SitePage {
  url: string
  path: string
  title: string | null
  content: string | null
  structuredData: StructuredDataItem[]
  headings: HeadingNode[]
  metaDescription: string | null
  lastModified: string | null
  sourceRef: SourceReference | null
}

/** Discriminated by adapter type — Git has file paths, CMS has post IDs */
type SourceReference =
  | { type: 'git-file'; filePath: string; lineStart?: number; lineEnd?: number }
  | { type: 'cms-post'; identifier: string; editUrl?: string }
  | { type: 'cms-product'; identifier: string; editUrl?: string }

/** Recommendation intent — enough to regenerate the fix at apply-time, not a stale patch */
interface SiteRecommendationIntent {
  type: string
  targetFile: string
  schemaType?: string
  content?: Record<string, unknown>
}

type SiteFixResult =
  | { applied: true; prUrl: string }
  | { applied: true; editUrl: string }
  | { applied: false; error: string }
```

### Key design decisions

1. **Scan vs remediation are separate interfaces.** `SiteSourceAdapter` handles read-only scanning. `SiteSourceRemediator` handles fix application. CMS adapters won't produce unified diffs — they'll produce content updates via their own API. This avoids Git-centric fields in the base interface.

2. **Store intent, not patches.** Recommendations store `SiteRecommendationIntent` (what to fix, where, what schema type) plus the `commitSha` the scan was against. At `apply-fix` time, the adapter generates the fix against the current repo state. This avoids stale patches in the DB as the repo moves.

3. **Dedicated endpoint, not the runs route.** The current `POST /projects/:name/runs` hard-gates on `kind !== 'answer-visibility'`. Rather than plumbing a new callback through `RunRoutesOptions`, create `POST /projects/:name/site-scan` — mirroring how GSC sync and sitemap inspection have their own trigger endpoints with dedicated callbacks.

---

## Run Kind Wiring: `site-scan`

**This is not a drop-in.** The current run route (`packages/api-routes/src/runs.ts:26`) hard-gates on `kind !== 'answer-visibility'`. The job runner in `server.ts:375` only handles answer-visibility. Cloud API (`apps/api/src/app.ts`) doesn't wire run callbacks at all.

**Approach:** Follow the GSC sync pattern (`google.ts` → `server.ts:333`):
1. Route handler creates a run record with `kind: 'site-scan'`
2. Calls `opts.onSiteScanRequested(runId, projectId)` callback
3. `server.ts` wires that callback to `executeSiteScan()`

Files:
- `packages/contracts/src/run.ts:7` — add `'site-scan'` to `runKindSchema`
- `packages/api-routes/src/site-source.ts` (new) — route plugin with callback
- `packages/api-routes/src/index.ts:143+` — register plugin
- `packages/canonry/src/server.ts:325+` — wire callback

---

## Auth & Storage Model

**Secrets in `~/.canonry/config.yaml`, non-secret config in DB.** Follows existing Google/Bing pattern.

```yaml
# ~/.canonry/config.yaml
github:
  token: ghp_xxxxxxxxxxxx    # PAT with repo read access
```

`siteConnections` DB table stores only non-secret config (adapter name, repo, branch). Connection store follows the Bing pattern in `server.ts:194-235`.

No split-brain:
- **Auth credentials**: `~/.canonry/config.yaml` (source of truth)
- **Connection config** (which repo, which branch): `siteConnections` table
- **Scan results**: `sitePages`, `siteRecommendations` tables

---

## Database Schema

Three new tables in `packages/db/src/schema.ts` with corresponding migrations in `migrate.ts`:

- **`site_connections`** — 1 per project: adapter, config JSON (repo, branch), timestamps. Unique index on `project_id`.
- **`site_pages`** — per-scan page snapshots: url, path, title, structured_data_types, headings, meta_description, has_schema, source_ref, commit_sha. Indexed on project_id and scan_run_id.
- **`site_recommendations`** — per-scan recommendations: page_url, keyword_id, type, severity, title, description, source_ref, intent JSON, commit_sha, status, pr_url. Indexed on project_id, scan_run_id, and (project_id, status).

---

## GitHub Adapter: Framework Detection

**This is the hard part.** Repo file trees don't reliably map to live URLs for dynamic routes, rewrites, or locales.

### MVP: narrow framework support + explicit mapping fallback

| Framework | Detection | Content paths | Route mapping |
|-----------|-----------|---------------|---------------|
| Next.js (app) | `next.config.*` + `app/` | `app/**/page.{tsx,jsx,mdx}` | Dir path = URL |
| Next.js (pages) | `next.config.*` + `pages/` | `pages/**/*.{tsx,jsx}` | File path = URL |
| Hugo | `hugo.toml`/`config.toml` | `content/**/*.md` | Front matter slug or dir path |
| Astro | `astro.config.*` | `src/pages/**/*.{astro,md,mdx}` | File path = URL |
| Plain HTML | Fallback | `**/*.html` | File path = URL |

**Explicitly NOT handled in MVP:** dynamic routes with data fetching, rewrites/redirects, i18n routing, `generateStaticParams`.

**Explicit mapping config** for when auto-detection fails:
```yaml
spec:
  siteSource:
    adapter: github
    repo: myorg/my-site
    contentPaths:
      - glob: "content/blog/**/*.md"
        urlPrefix: "/blog/"
```

**Structured data extraction** — pattern-match only, never execute:
- Regex `<script type="application/ld+json">` blocks
- YAML front matter in MD/MDX
- Heading tags / `#` headings
- `<meta name="description">` tags

---

## Recommendation Engine (Phase 2)

### Available data

`querySnapshots` stores: `citationState`, `citedDomains`, `competitorOverlap`, `answerText`, `groundingSources`. It does **not** store competitor page content or structured data.

### Correlation logic

For each keyword:
1. Find best-matching page (URL path terms, title/h1 match)
2. If not cited → check what the page is missing (schema, meta, headings)
3. If no matching page → content gap
4. Competitor comparison limited to "competitor X is cited, you are not" — structured data comparison requires optional page fetch via existing `site-fetch.ts`

### Recommendation types

| Type | Trigger | Severity |
|------|---------|----------|
| `content-gap` | Not-cited keyword, no matching page | High |
| `missing-schema` | Page targets keyword but lacks relevant schema | High |
| `competitor-advantage` | Competitor cited, yours isn't | High |
| `weak-headings` | Keyword missing from h1/h2 | Medium |
| `no-meta-description` | Page has no meta description | Medium |

---

## API Endpoints

| Method | Path | Phase |
|--------|------|-------|
| `PUT` | `/projects/:name/site-source` | 1 |
| `GET` | `/projects/:name/site-source` | 1 |
| `DELETE` | `/projects/:name/site-source` | 1 |
| `POST` | `/projects/:name/site-source/healthcheck` | 1 |
| `POST` | `/projects/:name/site-scan` | 1 |
| `GET` | `/projects/:name/pages` | 1 |
| `GET` | `/projects/:name/recommendations` | 2 |
| `PATCH` | `/projects/:name/recommendations/:id` | 2 |
| `POST` | `/projects/:name/recommendations/:id/apply` | 3 |

Plus: OpenAPI catalog entries, API client methods, route index registration.

---

## CLI Commands

```bash
# Phase 1
canonry site connect <project> --adapter github --repo owner/repo [--branch main]
canonry site disconnect <project>
canonry site status <project>
canonry site scan <project> [--wait] [--format json]
canonry site pages <project> [--format json]

# Phase 2
canonry site recommendations <project> [--severity high] [--status open] [--format json]

# Phase 3
canonry site apply-fix <project> <recommendation-id>
```

New files: `commands/site.ts`, `cli-commands/site.ts`. Register in `cli-commands.ts`.

---

## Config-as-Code Round-Trip

Adding `siteSource` to `configSpecSchema` also requires changes to:
- `packages/api-routes/src/apply.ts` — handle in apply body, upsert `siteConnections`
- `packages/api-routes/src/projects.ts` — include in export response
- `packages/contracts/src/project.ts` — add to `projectDtoSchema`

---

## What's Explicitly Out of Scope

- **Pre-publish analysis on draft PRs** — requires ref-specific scanning + GitHub App/check-run integration
- **Competitor structured data comparison** — optional enrichment using `site-fetch.ts`, not a requirement
- **Cloud mode wiring** — `apps/api/` and `apps/worker/` don't wire run callbacks today
- **Non-Git adapters** — interface supports them, no code written

---

## Implementation Steps

### Phase 1: Read-only GitHub scan (19 steps)

1. `packages/contracts/src/site-source.ts` (new) — shared types
2. `packages/contracts/src/run.ts` — add `'site-scan'`
3. `packages/contracts/src/config-schema.ts` — add `siteSource`
4. `packages/contracts/src/project.ts` — add to DTO
5. `packages/integration-github/` (new package) — adapter
6. `packages/db/src/schema.ts` — `siteConnections`, `sitePages` tables
7. `packages/db/src/migrate.ts` — migrations
8. `packages/api-routes/src/site-source.ts` (new) — route plugin
9. `packages/api-routes/src/index.ts` — register routes
10. `packages/api-routes/src/openapi.ts` — endpoint docs
11. `packages/api-routes/src/apply.ts` — handle `siteSource`
12. `packages/api-routes/src/projects.ts` — export `siteSource`
13. `packages/canonry/src/client.ts` — API client methods
14. `packages/canonry/src/site-scan.ts` (new) — execution function
15. `packages/canonry/src/server.ts` — connection store + callback
16. `packages/canonry/src/commands/site.ts` (new) — CLI handler
17. `packages/canonry/src/cli-commands/site.ts` (new) — CLI specs
18. `packages/canonry/src/cli-commands.ts` — register
19. Tests: framework detection, file→URL mapping, extraction, API endpoints

### Phase 2: Recommendations (7 steps)

1. `packages/db/src/schema.ts` — `siteRecommendations` table + migration
2. `packages/canonry/src/site-scan.ts` — recommendation engine
3. `packages/api-routes/src/site-source.ts` — recommendation endpoints
4. `packages/canonry/src/client.ts` — recommendation methods
5. `packages/canonry/src/commands/site.ts` — `recommendations` subcommand
6. Tests: recommendation generation with mock data

### Phase 3: Apply-fix (4 steps)

1. `packages/integration-github/src/remediate.ts` — `SiteSourceRemediator`
2. `packages/api-routes/src/site-source.ts` — apply endpoint
3. `packages/canonry/src/commands/site.ts` — `apply-fix` subcommand
4. Tests: patch generation

---

## Reusable Existing Code

- `packages/canonry/src/site-fetch.ts` — SSRF-safe HTML fetch
- `packages/api-routes/src/run-queue.ts` — `queueRunIfProjectIdle()`
- `packages/api-routes/src/helpers.ts` — `resolveProject()`, `writeAuditLog()`
- `server.ts:194-235` — Bing connection store pattern
- `server.ts:333-358` — GSC sync callback wiring pattern

Method	Path	Phase
`PUT`	`/projects/:name/site-source`	1
`GET`	`/projects/:name/site-source`	1
`DELETE`	`/projects/:name/site-source`	1
`POST`	`/projects/:name/site-source/healthcheck`	1
`POST`	`/projects/:name/site-scan`	1
`GET`	`/projects/:name/pages`	1
`GET`	`/projects/:name/recommendations`	2
`PATCH`	`/projects/:name/recommendations/:id`	2
`POST`	`/projects/:name/recommendations/:id/apply`	3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GitHub site integration with citation-correlated recommendations #152

Summary

Scope: 4 Phases

Architecture

Package: `packages/integration-github/`

Shared types in `packages/contracts/`

Key design decisions

Run Kind Wiring: `site-scan`

Auth & Storage Model

Database Schema

GitHub Adapter: Framework Detection

MVP: narrow framework support + explicit mapping fallback

Recommendation Engine (Phase 2)

Available data

Correlation logic

Recommendation types

API Endpoints

CLI Commands

Config-as-Code Round-Trip

What's Explicitly Out of Scope

Implementation Steps

Phase 1: Read-only GitHub scan (19 steps)

Phase 2: Recommendations (7 steps)

Phase 3: Apply-fix (4 steps)

Reusable Existing Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Description
1	GitHub PAT read-only repo scan + `site-scan` run kind
2	Page-keyword matching and citation-correlated recommendations
3	`apply-fix` for one narrow fix class (JSON-LD injection via PR)
4	GitHub App, draft-PR analysis, non-Git adapters (future, not scoped)

Framework	Detection	Content paths	Route mapping
Next.js (app)	`next.config.*` + `app/`	`app/**/page.{tsx,jsx,mdx}`	Dir path = URL
Next.js (pages)	`next.config.*` + `pages/`	`pages/*/.{tsx,jsx}`	File path = URL
Hugo	`hugo.toml`/`config.toml`	`content/*/.md`	Front matter slug or dir path
Astro	`astro.config.*`	`src/pages/*/.{astro,md,mdx}`	File path = URL
Plain HTML	Fallback	`*/.html`	File path = URL

Type	Trigger	Severity
`content-gap`	Not-cited keyword, no matching page	High
`missing-schema`	Page targets keyword but lacks relevant schema	High
`competitor-advantage`	Competitor cited, yours isn't	High
`weak-headings`	Keyword missing from h1/h2	Medium
`no-meta-description`	Page has no meta description	Medium

feat: GitHub site integration with citation-correlated recommendations #152

Description

Summary

Scope: 4 Phases

Architecture

Package: packages/integration-github/

Shared types in packages/contracts/

Key design decisions

Run Kind Wiring: site-scan

Auth & Storage Model

Database Schema

GitHub Adapter: Framework Detection

MVP: narrow framework support + explicit mapping fallback

Recommendation Engine (Phase 2)

Available data

Correlation logic

Recommendation types

API Endpoints

CLI Commands

Config-as-Code Round-Trip

What's Explicitly Out of Scope

Implementation Steps

Phase 1: Read-only GitHub scan (19 steps)

Phase 2: Recommendations (7 steps)

Phase 3: Apply-fix (4 steps)

Reusable Existing Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Package: `packages/integration-github/`

Shared types in `packages/contracts/`

Run Kind Wiring: `site-scan`