Skip to content

[wip] experiment with caching directory entries to avoid filepath.Glob for opening idx#18939

Open
sudeepdino008 wants to merge 6 commits intomainfrom
idx_name_cache
Open

[wip] experiment with caching directory entries to avoid filepath.Glob for opening idx#18939
sudeepdino008 wants to merge 6 commits intomainfrom
idx_name_cache

Conversation

@sudeepdino008
Copy link
Member

@sudeepdino008 sudeepdino008 commented Feb 3, 2026

Summary

  • Added MatchVersionedFile to search pre-scanned directory entries instead of per-file filepath.Glob calls
  • Updated Domain.OpenList to accept ScanDirsResult struct instead of individual arrays
  • Pre-scans directory entries once upfront to avoid repeated filesystem calls when opening dirty files
  • CaplinSnapshots.OpenList now uses snaptype.IdxFiles instead of os.ReadDir for efficiency
  • Added test for MatchVersionedFile handling seg/idx with different base names (blobsidecars.seg → blocksidecars.idx)
  • SnapshotRepo.openDirtyFiles now uses MatchVersionedFile with pre-scanned entries

In Future PRs

  • fileItemsWithMissedAccessors in db/state/dirty_files.go:777 - uses dir.FileExist per accessor, could use pre-scanned entries instead
  • If FindFilesWithVersionsByPattern can be fully replaced, move supported version check into MatchVersionedFile
  • Let openFolder/openList in InvertedIndex/Domain/History accept ScanDirsResult directly
  • Build missed accessors can use MatchVersionedFile
  • Rename BuildMissingIndices to BuildMissedAccessors in caplin/RoSnapshots for consistency with rest of codebase
  • Snaptype operations: Index.HasFile, SnapType.FileExist, ParseFromFile in db/snaptype/type.go
  • Block types: body/tx path resolution in db/snaptype2/block_types.go

@AskAlexSharov
Copy link
Collaborator

Thing is: dirtyFiles it's already a cache of files metadata - and we already do invalidate it by OpenFolder method. And it already works - even on external RPCD.

If somewhere we using unnecessary glob instead of dirty files lookup - then maybe just switch.

cc: @JkLondon

@sudeepdino008
Copy link
Member Author

Thing is: dirtyFiles it's already a cache of files metadata - and we already do invalidate it by OpenFolder method. And it already works - even on external RPCD.

If somewhere we using unnecessary glob instead of dirty files lookup - then maybe just switch.

cc: @JkLondon

but to build the dirty files set we need to do open seg + open index..
rn now the open index process is slow because of glob (which we do for all data files) and it slows down the offline commands startup significantly (specially on gnosis which has lots of files, maybe it'd be worse for polygon if we were working on it).

@AskAlexSharov
Copy link
Collaborator

I mean - you will invalidate one cache in same time with another cache.

@AskAlexSharov
Copy link
Collaborator

  1. we now calling FindFilesWithVersionsByPattern for each file. Just don't call it so often. Get all file names in dir once is enough:
func (h *History) openDirtyFiles() error {
	invalidFilesMu := sync.Mutex{}
	invalidFileItems := make([]*FilesItem, 0)
	h.dirtyFiles.Walk(func(items []*FilesItem) bool {
		for _, item := range items {
			fromStep, toStep := item.StepRange(h.stepSize)
			if item.decompressor == nil {
				fPathMask := h.vFilePathMask(fromStep, toStep)
				fPath, fileVer, ok, err := version.FindFilesWithVersionsByPattern(fPathMask)

@sudeepdino008 sudeepdino008 marked this pull request as draft February 7, 2026 15:54
@sudeepdino008 sudeepdino008 marked this pull request as ready for review February 9, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants