Skip to content

Parallelize dependency processing in sync pipeline #16

@johnlanda

Description

@johnlanda

Summary

The mctl up sync pipeline processes dependencies strictly sequentially — each dependency must complete its full fetch→chunk→embed→upsert cycle before the next begins. Parallelize dependency processing with a bounded worker pool to achieve 3-10x speedup on multi-dependency projects.

Context

The main sync loop in internal/pipeline/pipeline.go (line ~168) iterates dependencies with a plain for loop. Each dependency goes through: artifact check → git clone → file extraction → chunking → embedding API call → LanceDB upsert. These are independent per-dependency and safe to parallelize.

Key files:

  • internal/pipeline/pipeline.goSync() function, main for _, dep := range m.Dependencies loop (~line 168)
  • cmd/up.go — CLI entry point that calls pipeline.Sync() and aggregates results

Acceptance Criteria

  • Dependencies are processed concurrently using a bounded worker pool (default concurrency: min(len(deps), GOMAXPROCS) or configurable)
  • Results (synced/skipped/failed) are collected safely via channels or mutex
  • Errors in one dependency do not block or cancel others (existing behavior preserved)
  • Lockfile is still written atomically after all dependencies complete
  • go test ./internal/pipeline/... passes with race detector (-race)
  • Observable speedup on projects with 3+ dependencies (measure with time mctl up)

Technical Approach

  1. Add a concurrency option to the pipeline config (default: runtime.GOMAXPROCS(0))
  2. Replace the sequential loop with errgroup.Group with SetLimit(concurrency) or a semaphore-based worker pool
  3. Collect SyncResult values via a thread-safe slice or channel
  4. Ensure the embedder and store are safe for concurrent use (LanceDB connection is thread-safe; embedders are stateless per-call)
  5. Write lockfile only after all goroutines complete

Dependencies

None — standalone improvement.

Out of Scope

  • Parallel chunking within a single dependency (separate issue)
  • Concurrent embedding batches within a single dependency (separate issue)
  • Progress bar / per-dependency status reporting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions