Skip to content

Implement git clone caching for fetcher #17

@johnlanda

Description

@johnlanda

Summary

The GitHub fetcher clones repositories from scratch on every mctl up invocation, even when the ref hasn't changed. Implement a persistent clone cache to avoid redundant network I/O, saving 5-30s per sync for large repositories.

Context

The GitHub fetcher in internal/fetchers/github.go creates a temp directory (line ~40), clones into it, extracts files, then deletes the temp dir (line ~44 via defer). There is no reuse of previously cloned data between syncs. For large repos (e.g., envoyproxy/gateway), this means re-downloading hundreds of MB on every sync even when nothing has changed.

Key files:

  • internal/fetchers/github.goFetch() method, os.MkdirTemp() + os.RemoveAll() pattern
  • internal/pipeline/pipeline.go — calls fetcher.Fetch() per dependency

Acceptance Criteria

  • Cloned repos are cached in a persistent directory (e.g., ~/.mycelium/cache/clones/<source-hash>/)
  • Cache key includes repository URL and git ref (tag/branch/commit)
  • Cache hit: reuse existing clone directory, skip git clone entirely
  • Cache miss: clone as before, persist to cache dir
  • Stale cache entries are cleaned up (TTL-based or LRU with max size)
  • mctl up with a warm cache is measurably faster (no network I/O for cached deps)
  • Tests cover cache hit, cache miss, and cache invalidation scenarios

Technical Approach

  1. Define cache directory structure: ~/.mycelium/cache/clones/<sha256(repo+ref)>/
  2. Before cloning, check if cache dir exists and is valid (contains .git/)
  3. On cache hit: use cached directory directly for file extraction
  4. On cache miss: clone into cache dir instead of temp dir, skip RemoveAll
  5. Add a mctl cache clean command or automatic TTL-based cleanup
  6. For branch refs (not pinned commits), consider git fetch to update rather than full re-clone

Dependencies

None — standalone improvement.

Out of Scope

  • Concurrent fetching of multiple deps (separate issue)
  • Shallow clone optimization (could be a follow-up)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions