Skip to content

Implement GitHub Informer for cache-based issue watching #418

@gjkim42

Description

@gjkim42

Problem

The current GitHubSource in internal/source/github.go is stateless — every poll cycle does a full re-fetch from the GitHub API:

  1. List all matching issues → 1-10 API calls (paginated, up to maxPages=10)
  2. For each matched issue, fetch comments → N API calls
  3. Total: ~10 + N calls per cycle

With a pollInterval: 1m, this works fine when labels do server-side filtering (N is small). But if we move to comment-based or other client-side filtering (see related issue), N becomes ALL open issues, potentially hitting 60K+ calls/hour on active repos — far exceeding GitHub's 5K/hr rate limit.

Proposed Solution: GitHub Informer

Implement a cache-based GitHub watcher, analogous to the Kubernetes informer pattern:

K8s Informer GitHub Informer
Initial List all resources Initial List all issues + comments
Watch stream for changes Poll with since param / ETags for deltas
Local cache (store) serves reads Local cache serves Discover()
Reflector keeps cache in sync Syncer keeps cache in sync

Architecture

                    ┌─────────────────────────┐
                    │   GitHub Informer        │
                    │                          │
  GitHub API ──────→│  Syncer (ETag/since)     │
  (delta updates)   │         │                │
                    │         ▼                │
                    │  Local Cache             │
                    │  (issues + comments      │
                    │   + labels + reactions)   │
                    └────────┬────────────────┘
                             │
                    ┌────────▼────────────────┐
                    │  GitHubSource.Discover() │
                    │                          │
                    │  Reads from cache, not   │
                    │  from API. Can filter by │
                    │  labels OR comments OR   │
                    │  anything — all free.    │
                    └─────────────────────────┘

How It Works

Initial sync (once):

  • List all open issues + their comments → expensive but one-time

Subsequent cycles:

  • GET /issues?since=<last_sync_time> → only changed issues (usually 0-5)
  • GET /issues/{n}/comments?since=<last_sync_time> → only new comments
  • GitHub supports ETags (If-None-Match header) — a 304 Not Modified response does not count against rate limits
  • Cost drops from O(issues) to O(changes) per cycle

Key Benefits

  1. Decouples data fetching from state evaluation. Once the cache is warm, filtering by labels, comments, reactions, or any future signal mechanism is free — just read from cache.
  2. Makes the label vs. comment debate a UX decision, not a technical constraint. The rate limit concern (biggest objection to comment-based filtering) is eliminated.
  3. Familiar pattern. The team already works with K8s informers. Same mental model.
  4. Backward compatible. GitHubSource.Discover() returns the same []WorkItem — the interface doesn't change, only the implementation.

Implementation Sketch

// GitHubInformer maintains a local cache of GitHub issues and comments.
type GitHubInformer struct {
    owner    string
    repo     string
    token    string
    baseURL  string
    client   *http.Client

    // Cache
    mu       sync.RWMutex
    issues   map[int]*CachedIssue  // issue number → cached data
    lastSync time.Time             // for `since` parameter
    etag     string                // for conditional requests
}

type CachedIssue struct {
    Issue    githubIssue
    Comments []githubComment
    LastSeen time.Time
}

// Sync fetches changes since last sync and updates the cache.
func (i *GitHubInformer) Sync(ctx context.Context) error {
    // Use ?since=<lastSync> to get only changed issues
    // Use If-None-Match for ETag-based conditional requests
    // Update cache incrementally
    // Evict closed issues
}

// GitHubSource.Discover() reads from the informer cache instead of making API calls.
func (s *GitHubSource) Discover(ctx context.Context) ([]WorkItem, error) {
    if err := s.informer.Sync(ctx); err != nil {
        return nil, err
    }
    // Read from cache — filter by labels, comments, whatever is configured
    // Zero additional API calls
}

Considerations

  • Memory usage: A repo with 10K open issues needs meaningful memory. Mitigate by only caching open issues and evicting closed ones on sync.
  • Cold start: First sync is expensive. Could persist cache to ConfigMap/PV for fast restarts, or accept one slow cycle.
  • Consistency: Cache is at most pollInterval stale — same as current system (it's polling too).
  • Multiple spawners: If multiple TaskSpawners watch the same repo, consider a shared informer (like K8s SharedInformerFactory) to avoid duplicate API calls.
  • Webhook upgrade path: The informer architecture naturally supports adding GitHub webhooks later for real-time cache invalidation, replacing polling entirely.

Files That Would Change

  • internal/source/github.go — Split into informer + source; Discover() becomes cache reader
  • internal/source/github_informer.go — New file for the informer implementation
  • cmd/axon-spawner/main.go — Initialize informer, pass to source
  • internal/source/github_test.go — Tests for delta sync, ETag handling, cache behavior

Related

  • Comment-based workflow control proposal: Moving beyond label-based workflow control #417 (depends on this for viable API rate limits)
  • Current polling implementation: internal/source/github.go
  • K8s informer pattern: k8s.io/client-go/tools/cache

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions