-
-
Notifications
You must be signed in to change notification settings - Fork 5
Make headword matches trump fts rank #2112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It doesn't make any sense to sort bad matches first
because that's what we primarily display - it has a special status not just a higher weight
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughMultiple backend files refactored to standardize writing system handling by replacing WritingSystem object references with WritingSystemId identifiers throughout the sorting and filtering pipeline. MiniLcmRepository ensures default writing systems are populated during entry retrieval, while public APIs simplified by removing ordering directives from method signatures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Argos notifications ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
backend/FwLite/LcmCrdt/FullTextSearch/EntrySearchService.cs (1)
47-49: Consider extracting the magic number10_000as a named constant.The ranking logic correctly prioritizes headword matches (sorted by length) above non-headword matches. However, the magic number
10_000could be more self-documenting:+private const int NonHeadwordMatchPenalty = 10_000; + if (rankResults) { filtered = filtered - .OrderBy(t => SqlHelpers.ContainsIgnoreCaseAccents(t.searchRecord.Headword, query) ? t.searchRecord.Headword.Length : 10_000) + .OrderBy(t => SqlHelpers.ContainsIgnoreCaseAccents(t.searchRecord.Headword, query) ? t.searchRecord.Headword.Length : NonHeadwordMatchPenalty) .ThenBy(t => Sql.Ext.SQLite().Rank(t.searchRecord)).ThenBy(t => t.entry.Id); }backend/FwLite/LcmCrdt/Data/MiniLcmRepository.cs (1)
208-211: Defensive validation is good, but consider the exception type.This validation throws
ArgumentExceptionif the writing system is missing afterEnsureWritingSystemIsPopulatedwas called. SinceGetEntriesalways callsEnsureWritingSystemIsPopulatedfirst, this acts as a defensive safeguard for directApplySortingcalls.Consider whether
InvalidOperationExceptionmight be more appropriate here, as the issue would be a programming error (caller forgot to populate) rather than an invalid argument per se. This is a minor observation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/FwLite/LcmCrdt/Data/MiniLcmRepository.cs(4 hunks)backend/FwLite/LcmCrdt/Data/SetupCollationInterceptor.cs(1 hunks)backend/FwLite/LcmCrdt/Data/SqlSortingExtensions.cs(1 hunks)backend/FwLite/LcmCrdt/FullTextSearch/EntrySearchService.cs(2 hunks)backend/FwLite/MiniLcm.Tests/QueryEntryTestsBase.cs(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: imnasnainaec
Repo: sillsdev/languageforge-lexbox PR: 1867
File: platform.bible-extension/src/main.ts:239-246
Timestamp: 2025-07-31T19:10:41.178Z
Learning: In the sillsdev/languageforge-lexbox repository, user imnasnainaec prefers to defer code improvements when there are related TODO comments indicating planned refactoring work, choosing to bundle related changes together rather than making incremental improvements that would need to be modified again during the larger refactoring.
📚 Learning: 2025-06-13T09:25:37.958Z
Learnt from: hahn-kev
Repo: sillsdev/languageforge-lexbox PR: 1698
File: backend/FwLite/LcmCrdt/Data/Filtering.cs:25-35
Timestamp: 2025-06-13T09:25:37.958Z
Learning: In backend/FwLite/LcmCrdt/Data/Filtering.cs `FtsFilter`, the `&&` combination between the FTS `MATCH` result and the `SearchValue` fallback is intentional to maintain existing search behavior; any future change to use `||` (or another approach) will be considered later.
Applied to files:
backend/FwLite/LcmCrdt/FullTextSearch/EntrySearchService.cs
🧬 Code graph analysis (3)
backend/FwLite/LcmCrdt/FullTextSearch/EntrySearchService.cs (2)
backend/FwLite/LcmCrdt/SqlHelpers.cs (2)
SqlHelpers(10-36)Sql(34-35)backend/FwLite/MiniLcm/Models/Entry.cs (1)
Headword(34-42)
backend/FwLite/LcmCrdt/Data/SetupCollationInterceptor.cs (1)
backend/FwLite/LcmCrdt/Data/SqlSortingExtensions.cs (2)
SqlSortingExtensions(6-29)CollationName(23-28)
backend/FwLite/LcmCrdt/Data/MiniLcmRepository.cs (3)
backend/FwLite/FwDataMiniLcmBridge/Api/FwDataMiniLcmApi.cs (2)
Order(742-760)WritingSystem(100-116)backend/FwLite/MiniLcm/Exceptions/NotFoundException.cs (6)
NotFoundException(6-44)NotFoundException(8-10)NotFoundException(23-26)NotFoundException(28-31)NotFoundException(33-36)NotFoundException(38-41)backend/FwLite/MiniLcm/Models/Entry.cs (1)
Headword(34-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build FwHeadless / publish-fw-headless
- GitHub Check: Analyze (csharp)
- GitHub Check: frontend-component-unit-tests
- GitHub Check: frontend
- GitHub Check: Build FW Lite and run tests
🔇 Additional comments (9)
backend/FwLite/LcmCrdt/Data/SetupCollationInterceptor.cs (1)
125-127: LGTM!The change correctly passes
writingSystem.WsIdto match the updatedCollationName(WritingSystemId wsId)signature, maintaining consistency with the refactor to useWritingSystemIdthroughout the sorting/collation pipeline.backend/FwLite/MiniLcm.Tests/QueryEntryTestsBase.cs (1)
389-389: Good test case for validating the headword ranking fix.This test case directly validates the PR objective: the entry "aap" with a very long gloss that would penalize BM25 ranking now correctly appears first as an exact headword match, followed by shorter headword matches. The expected order
"aap,maap,liaap,maap-aap"confirms headword matches are prioritized and sorted by length.backend/FwLite/LcmCrdt/FullTextSearch/EntrySearchService.cs (1)
32-34: LGTM - Simplified signature.Removing the
orderAscendingparameter simplifies the API since ranking now follows a fixed priority order (headword matches first, then BM25 rank).backend/FwLite/LcmCrdt/Data/MiniLcmRepository.cs (3)
131-132: LGTM - Writing system is now populated early in the flow.This ensures the writing system is resolved before filtering and sorting operations, aligning with the PR objective to populate the sort writing system earlier.
158-166: LGTM - Clean helper for ensuring writing system is populated.The method correctly:
- Short-circuits if a writing system is already specified
- Retrieves the default vernacular writing system
- Throws a descriptive exception if none exists
- Returns a new
QueryOptionswith the populated writing system
214-218: LGTM - Sorting logic correctly useswsIdfor collation.The headword sorting now uses the local
wsIdfor both theHeadword()call andCollateUnicode(), ensuring consistent writing system handling throughout the sorting pipeline.backend/FwLite/LcmCrdt/Data/SqlSortingExtensions.cs (3)
11-15: LGTM - Simplified parameter type.Using
WritingSystemIdinstead of the fullWritingSystemobject is a cleaner API design since only the identifier is needed for collation purposes.
17-21: LGTM - Expression updated consistently.The expression signature and lambda correctly use
WritingSystemIdthroughout.
23-28: LGTM - Direct access towsId.Codeis cleaner.The change from
ws.WsId.CodetowsId.Codesimplifies the code while maintaining the same collation name format.
In a real project, an exact headword match was show as the approx. 10th item. The reason ended up being that the entry had more gloss text than the entries above it, which causes it to be penalized. I added a test that demonstrates that and was initially failing.
I think it makes sense to give the headword a special status rather than just a higher weight, because the headword is what we show front and center as if it's the primary thing we match against, so I think it should be the primary thing we match against.
So, this PR always puts headword matches above non-headword matches and sorts the headword matches by their length.