fix: prevent OOM kill by skipping excluded dirs before traversal#196
fix: prevent OOM kill by skipping excluded dirs before traversal#196jimmy-infinex wants to merge 1 commit intotobi:mainfrom
Conversation
Bun's Glob.scan() traverses into node_modules and other excluded directories before the post-hoc filter can reject them. On large monorepos (e.g. 8+ sub-repos with massive node_modules), this causes the process to consume gigabytes of memory and get SIGKILL'd by the OS. Replace Glob.scan() with a manual recursive directory walker that checks excludeDirs and hidden directories BEFORE descending. Files are still matched against the glob pattern via Glob.match(). This is a drop-in behavioral replacement — same files indexed, same exclusions applied, but memory usage stays constant regardless of how large the excluded directories are.
There was a problem hiding this comment.
Pull request overview
This PR fixes an OOM (Out of Memory) issue that occurs when indexing large monorepos. The problem was that Glob.scan() from Bun traverses into excluded directories like node_modules before filtering can reject them, causing memory exhaustion on repositories with millions of files.
Changes:
- Replaced
Glob.scan()with manual recursive directory traversal usingreaddirSyncthat checks exclusion criteria before descending into directories - Added
readdirSyncimport fromfsmodule - Implemented
walkDir()helper function with proper symlink handling and early directory exclusion
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| progress.indeterminate(); | ||
| const glob = new Glob(globPattern); | ||
| const excludeSet = new Set([...excludeDirs, "node_modules"]); |
There was a problem hiding this comment.
The excludeSet includes "node_modules" twice: once from the spread of excludeDirs array (which already contains "node_modules" at line 1354) and once as a direct addition. This redundancy should be removed by either removing "node_modules" from the explicit addition here, or removing it from the excludeDirs array definition.
| const excludeSet = new Set([...excludeDirs, "node_modules"]); | |
| const excludeSet = new Set(excludeDirs); |
| if (entry.isSymbolicLink()) { | ||
| try { | ||
| const target = statSync(resolve(resolvedPwd, dir, name)); | ||
| if (!target.isDirectory()) { | ||
| // Symlink to a file — check if it matches | ||
| const relPath = relPrefix ? `${relPrefix}/${name}` : name; | ||
| if (glob.match(relPath)) files.push(relPath); | ||
| continue; | ||
| } | ||
| } catch { | ||
| continue; // Broken symlink | ||
| } | ||
| } | ||
| walkDir(dir ? `${dir}/${name}` : name, relPrefix ? `${relPrefix}/${name}` : name); |
There was a problem hiding this comment.
The symlink handling could lead to infinite recursion if there are circular symlinks in the directory tree. While this is also an issue with the previous Glob.scan({followSymlinks: true}) implementation, consider adding cycle detection by tracking visited real paths (using realpathSync) to prevent potential stack overflow errors. This would make the implementation more robust than the original.
Problem
Glob.scan()from Bun traverses intonode_modules,.git, and other excluded directories before the post-hoc filter can reject matching files. On large monorepos (e.g. 8+ JavaScript sub-repos with massivenode_modulestrees), this causes the indexing process to buffer millions of directory entries in memory and getSIGKILL'd by the OS.Reproducing is straightforward: run
qmd collection addon any directory containing a large JS monorepo with**/*.mdas the glob pattern.Fix
Replace
Glob.scan()with a manual recursive directory walker (readdirSync+withFileTypes) that checksexcludeDirsand hidden directories (.*) before descending into them. Files are still matched against the glob pattern viaGlob.match().This is a drop-in behavioral replacement:
Testing
node_modules/directory with files is correctly skipped (not traversed), while sibling directories are indexed normally