Variant indexer performance improvements#1551
Conversation
- Add sortSmartAlpha() and sortLowercase() to FieldBuilder, replacing the single sort() method. sortSmartAlpha uses the regex-based smart_alpha_sort normalizer (zero-pad numbers for natural sort). sortLowercase uses the simple lowercase normalizer (no regex). - Switch 12 of 17 variant index sort fields to sortLowercase where field values are pure text (vepImpact, siftPrediction, polyphenPrediction, transcriptType, variantType, vepConsequences, alterationTypeSortOrder, associatedPhenotype, diseaseTerms, alleleSynonyms). Only 5 fields with meaningful numbers (symbol, hgvs, variantTranscript.name, geneSymbol) remain on sortSmartAlpha. This eliminates ~7 billion regex evaluations during variant indexing (600M+ docs). The smart_alpha_sort regex (PatternReplaceCharFilter) was consuming 95% of ES write thread CPU per hot thread analysis. - Add catch-all Exception handler in RoutedBulkIndexer to log uncaught RuntimeExceptions that silently kill BP threads, then System.exit(-1) to fail fast. Investigation found 7 of 32 MGD BP threads dying silently during indexing with no stack trace. - All site_index sort() calls renamed to sortSmartAlpha() (no behavior change).
Code ReviewOverall: Changes look correct. The sort strategy split is well-reasoned and the field-level choices in One issue to consider
Mapping changes look correct
LGTM with the minor |
Summary
sort()method with explicitsortSmartAlpha()andsortLowercase()on FieldBuilder. Switch 12 of 17 variant index sort fields tosortLowercasewhere values are pure text — eliminates ~7 billion regex evaluations during indexing (600M+ docs). Hot thread analysis showed thePatternReplaceCharFilterregex consuming 95% of ES write thread CPU.System.exit(-1). Investigation found 7 of 32 MGD BP threads dying silently during indexing with no diagnostics.Fields switched to sortLowercase (variant index only)
alterationTypeSortOrderconsequence.vepImpact.nameconsequence.vepConsequences.nameconsequence.variantTranscript.transcriptType.nameconsequence.siftPrediction.nameconsequence.polyphenPrediction.namevariant.variantType.name/variants.variantType.namevariants.*.vepConsequences.nameassociatedPhenotype,diseaseTerms.name,alleleSynonyms.displayTextFields kept on sortSmartAlpha
symbolNC_000078.7:g.34620827C>T*.hgvsconsequence.variantTranscript.nameENSEMBL:ENSMUST00000094140geneSymbol.displayTextHdac9,BRCA1Test plan