Summary
File chunking in ProcessFiles() iterates over files sequentially. Since chunking is CPU-bound and independent per file, parallelize it with a goroutine pool to achieve 5-10x speedup on dependencies with many files.
Context
The ProcessFiles() function in internal/pipeline/pipeline.go (line ~415) loops through files with for _, f := range files, calling Chunk() on each file one at a time. Both the markdown chunker (internal/chunker/markdown.go) and the code chunker (internal/chunker/code.go) are stateless per-file and safe to parallelize. Tree-sitter parsing in the code chunker is the most CPU-intensive operation.
Key files:
internal/pipeline/pipeline.go — ProcessFiles() function (line ~415-460)
internal/chunker/markdown.go — Chunk() method, heading-aware splitting
internal/chunker/code.go — Chunk() method, tree-sitter AST parsing
Acceptance Criteria
Technical Approach
- In
ProcessFiles(), replace the sequential loop with a worker pool (e.g., errgroup with limit)
- Each worker receives a file, calls the appropriate chunker, returns
[]Chunk
- Collect results into a thread-safe accumulator, preserving file order for determinism
- After all workers complete, proceed to the existing embedding step unchanged
- Default concurrency:
runtime.GOMAXPROCS(0)
Dependencies
None — can be done independently of parallel dependency processing.
Out of Scope
- Parallel embedding batches (separate issue)
- Changes to chunker implementations themselves
Summary
File chunking in
ProcessFiles()iterates over files sequentially. Since chunking is CPU-bound and independent per file, parallelize it with a goroutine pool to achieve 5-10x speedup on dependencies with many files.Context
The
ProcessFiles()function ininternal/pipeline/pipeline.go(line ~415) loops through files withfor _, f := range files, callingChunk()on each file one at a time. Both the markdown chunker (internal/chunker/markdown.go) and the code chunker (internal/chunker/code.go) are stateless per-file and safe to parallelize. Tree-sitter parsing in the code chunker is the most CPU-intensive operation.Key files:
internal/pipeline/pipeline.go—ProcessFiles()function (line ~415-460)internal/chunker/markdown.go—Chunk()method, heading-aware splittinginternal/chunker/code.go—Chunk()method, tree-sitter AST parsingAcceptance Criteria
embedder.Embed()call after parallel chunkinggo test -race ./internal/pipeline/...passesTechnical Approach
ProcessFiles(), replace the sequential loop with a worker pool (e.g.,errgroupwith limit)[]Chunkruntime.GOMAXPROCS(0)Dependencies
None — can be done independently of parallel dependency processing.
Out of Scope