refactor SearchService to optimize candidate note retrieval#33
refactor SearchService to optimize candidate note retrieval#33PrivateGER merged 1 commit intodevelopfrom
Conversation
WalkthroughThe search service refactors note search from a single monolithic query into a two-phase approach: first selecting note IDs via indexed conditions and text filters, then fetching complete note records with visibility, blocking, and muting enforcement applied in the second query. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
packages/backend/src/core/SearchService.ts (2)
328-334: Consider adding basic visibility filtering to reduce false candidates.The 5x multiplier accounts for filtering, but if the database has many follower-only or specified-visibility notes, this could still result in underfetching. Adding a simple pre-filter on the candidate query could improve hit rates without adding complex joins:
🔎 Proposed enhancement
if (opts.filetype) { candidateQuery.andWhere('note."attachedFileTypes" && :types', { types: fileTypes[opts.filetype] }); } + + // Pre-filter to searchable visibility levels to reduce false candidates + candidateQuery.andWhere('note.visibility IN (:...visibilities)', { + visibilities: ['public', 'home'] + }); // Fetch more candidates than needed since some will likely be filtered by visibility checks const candidateRows = await candidateQuery.limit(pagination.limit * 5).getRawMany();This is safe because
generateVisibilityQueryin the second phase already filters to these visibility levels for non-authenticated or non-follower users, and notes with restricted visibility won't match text search indexes anyway in most configurations.
425-428: Pre-existing: MeiliSearch result ordering doesn't match SQL path behavior.Not introduced by this PR, but worth noting: the MeiliSearch path hardcodes descending sort (
a.id > b.id ? -1 : 1) regardless of pagination direction, while the refactored SQL path correctly usessortOrderbased onsinceId/untilId. Consider aligning this behavior for consistency in a follow-up.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/backend/src/core/SearchService.ts
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-19T14:14:22.397Z
Learnt from: CR
Repo: PrivateGER/hydrus-nextbooru PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-19T14:14:22.397Z
Learning: Applies to src/app/api/tags/search/**/*.{ts,tsx} : Implement progressive tag search filtering - only show tags that co-occur with already-selected tags
Applied to files:
packages/backend/src/core/SearchService.ts
🔍 Remote MCP
The Context7 library search didn't yield specific documentation for SearchService or Sharkey. However, the web search provided useful technical context. Let me search for more specific information about Sharkey and the PR's implementation approach.
Summary of Relevant Context for PR Review
Project Context
Sharkey is a Misskey fork that follows upstream changes when possible while adding its own features. It is ActivityPub software that can interface with the fediverse—an interconnected social network connecting with other software such as Akkoma, Mastodon, and Pixelfed. Sharkey uses TypeORM as a database tool to make database migrations easier, and the PR targets the SearchService component in the backend.
Database Search Architecture Context
This PR optimizes note search through a two-phase query approach. The refactor is strategically aligned with PostgreSQL full-text search best practices:
Full-Text Search Technologies Referenced:
- PostgreSQL uses tsvector and tsquery data types for full text searches, and GIN indexes are recommended for full-text search vectors (tsvector)
- In PostgreSQL, a tsvector is a special data type used to represent text in a form optimized for full-text search. A tsvector is a vector of lexemes (the basic units of text)—it simplifies language for search operations
- The combination of tsvector and tsquery provides dynamic search capabilities, while the strategic use of GIN indexes on generated tsvector columns significantly enhances search performance
Query Optimization Principles
The PR's two-phase candidate selection pattern aligns with recognized PostgreSQL optimization techniques:
- PostgreSQL generates various execution plans and the optimizer evaluates these with the intention of choosing the most efficient plan, considering factors like available indexes, join strategies, and data distribution
- The planner/optimizer's task is to create an optimal execution plan. A given SQL query can be executed in many different ways, and if computationally feasible, the query optimizer will examine each possible execution plan, selecting the one expected to run fastest
- Reducing the amount of data to be sorted is vital to performance and important to the user experience
Relevant Indexing Considerations
For Sharkey/Akkoma instances, additional indexes can be created on note and user tables to provide much faster search at the cost of additional disk space, with the PostgreSQL pg_trgm extension enabled to support queries using LIKE and ILIKE operators.
Key Review Points for This PR
The refactor decouples candidate selection (using indexed conditions with full-text filters) from full data retrieval (with visibility/blocking/muting enforcement), which reduces query planner complexity by narrowing the candidate set before expensive joins are performed. This approach leverages indexed access patterns identified as performance optimizations in PostgreSQL documentation and aligns with best practices for full-text search implementations using tsvector and GIN indexes.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: semgrep-cloud-platform/scan
🔇 Additional comments (2)
packages/backend/src/core/SearchService.ts (2)
279-308: Candidate query setup looks solid.The two-phase approach correctly separates ID selection from full data retrieval, allowing PostgreSQL to utilize indexes more effectively. The pagination logic properly mirrors
makePaginationQuerybehavior, and the provider-specific full-text search filters are appropriate.
336-354: Full notes retrieval with visibility checks is well-structured.The second query correctly fetches complete note data with all necessary joins and applies comprehensive visibility, blocking, and muting checks. The sort order is preserved via the
orderByclause matching the candidate query's order.
…ly use indexes
What
Why
Additional info (optional)
Checklist
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.