Skip to content

Parallelize file chunking within dependencies #18

@johnlanda

Description

@johnlanda

Summary

File chunking in ProcessFiles() iterates over files sequentially. Since chunking is CPU-bound and independent per file, parallelize it with a goroutine pool to achieve 5-10x speedup on dependencies with many files.

Context

The ProcessFiles() function in internal/pipeline/pipeline.go (line ~415) loops through files with for _, f := range files, calling Chunk() on each file one at a time. Both the markdown chunker (internal/chunker/markdown.go) and the code chunker (internal/chunker/code.go) are stateless per-file and safe to parallelize. Tree-sitter parsing in the code chunker is the most CPU-intensive operation.

Key files:

  • internal/pipeline/pipeline.goProcessFiles() function (line ~415-460)
  • internal/chunker/markdown.goChunk() method, heading-aware splitting
  • internal/chunker/code.goChunk() method, tree-sitter AST parsing

Acceptance Criteria

  • Files are chunked concurrently using a bounded goroutine pool
  • Chunk results are collected and ordered deterministically (same output as sequential)
  • All chunks are still batched into a single embedder.Embed() call after parallel chunking
  • go test -race ./internal/pipeline/... passes
  • Measurable speedup on dependencies with 50+ files

Technical Approach

  1. In ProcessFiles(), replace the sequential loop with a worker pool (e.g., errgroup with limit)
  2. Each worker receives a file, calls the appropriate chunker, returns []Chunk
  3. Collect results into a thread-safe accumulator, preserving file order for determinism
  4. After all workers complete, proceed to the existing embedding step unchanged
  5. Default concurrency: runtime.GOMAXPROCS(0)

Dependencies

None — can be done independently of parallel dependency processing.

Out of Scope

  • Parallel embedding batches (separate issue)
  • Changes to chunker implementations themselves

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions