gitindex: optimize git index time by ~21%#1036
Open
clemlesne wants to merge 1 commit intosourcegraph:mainfrom
Open
gitindex: optimize git index time by ~21%#1036clemlesne wants to merge 1 commit intosourcegraph:mainfrom
clemlesne wants to merge 1 commit intosourcegraph:mainfrom
Conversation
…etes) Reduce wall-clock indexing time by eliminating redundant work, reducing heap allocations, and tuning GC for batch workloads. Measured with hyperfine (15 runs, 3 warmup) indexing kubernetes/kubernetes (29,226 files, ~230 MB, ctags disabled): | Command | Mean [s] | Min [s] | Max [s] | Relative | |:-----------------|:-------------:|--------:|--------:|:------------:| | baseline (main) | 5.987 ± 0.419 | 5.590 | 7.001 | 1.24 ± 0.09 | | optimized (PR) | 4.813 ± 0.100 | 4.690 | 5.058 | 1.00 | Key changes: - rank(): replace redundant enry.IsGenerated/IsVendor/IsTest calls with the already-computed doc.Category from DetermineFileCategory. Move DetermineFileCategory into Builder.Add so it overlaps with catfile I/O and is skipped in ShardBuilder.Add when already set. - ShardBuilder.Add: skip the second bytes.IndexByte binary scan when DocChecker.Check already ran (doc.Category != FileCategoryMissing). - newSearchableString: add single-byte varint fast path — ~80 % of posting-list deltas are < 128 and append(s, byte) avoids the slice header setup of append(s, slice...). - indexCatfileBlobs: replace ~29 K individual make([]byte, size) with a contentSlab that sub-slices from 16 MB buffers (3-index slices to prevent cross-file corruption). - zoekt-git-index main: set GOGC=-1 + GOMEMLIMIT=2 GiB for batch indexing to cut madvise syscall overhead. Both honour the corresponding env vars when the user overrides them.
89a951b to
b2b2e85
Compare
Member
|
Thanks, I'm out for easter weekend so will take a look Tuesday. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rank()was callingenry.IsGenerated/IsVendor/IsTestper document, duplicating work already done byDetermineFileCategory. Now uses cacheddoc.Category. MovedDetermineFileCategorytoBuilder.Add()to overlap with catfile I/O.ShardBuilder.Add()re-scanned all content for null bytes even thoughDocChecker.Checkalready did this. Now skipped when Category is already set.append(slice, byte)is cheaper thanappend(slice, slice...).make([]byte, size)with 16MB slab sub-slicing (3-index slices for safety), reducing GC pressure.GOGC=-1+GOMEMLIMIT=2GiBminimizes madvise syscall overhead. Both respect user env var overrides.Benchmark (hyperfine)
Measured with
hyperfineonkubernetes/kubernetes(29,226 files, ~230MB, no ctags):baseline (main)optimized (PR)1.24x faster (mean), with notably lower variance (σ=0.100 vs σ=0.419).
Behavioral notes
rank()now usesDetermineFileCategory's mutually exclusive categories instead of independentenry.IsGenerated/IsVendor/IsTestchecks. A file matching multiple categories (e.g. generated test file) previously received multiple rank penalties; now it receives only the highest-priority one. This changes intra-shard document ordering slightly (<0.02% shard size difference) but does not affect search correctness.SkipReasonTooManyTrigramsfiles,DetermineFileCategorynow receives the original content (previously it received the marker string due to content replacement inShardBuilder.Add). This is more correct —DetermineFileCategorywas already designed to handle this case.Test plan
go test ./...full suite passesShardBuilder.Addstill performs binary check +DetermineFileCategoryfor direct callers (merge path,AddFile) whereCategory == FileCategoryMissingGOGCandGOMEMLIMITenv vars are respected when set by the user🤖 Generated with Claude Code