Skip to content

Add community-based diversity for search results #2

@colek42

Description

@colek42

Summary

Implement community-based diversification using Leiden clustering to ensure search results span different semantic modules of the codebase.

Background

The Python REFRAG implementation uses Leiden algorithm to cluster code into semantic communities, then enforces diversity constraints (max 5 results per community). This prevents all results coming from a single module.

Reference: refrag_ollama.py:2370-2400

Current State

We have basic file/folder diversity (max 2 per file, max 6 per folder), but this is structural, not semantic.

Features

Community Detection

  • Use Leiden algorithm to cluster code into semantic communities
  • Communities are based on semantic similarity (embedding proximity)
  • Provides better module boundaries than file/folder structure

Diversity Enforcement

  • Enforce max_per_community constraint (default: 5)
  • Ensures results span multiple semantic modules
  • More effective than file/folder diversity for large codebases

Community Visualization

  • Print community map showing top modules
  • Display community distribution in search results
  • Help users understand codebase structure

Implementation Tasks

  • Implement or integrate Leiden clustering algorithm
  • Build community graph from chunk embeddings
  • Add community metadata to chunks
  • Implement diversity enforcement in search
  • Add community visualization/summary

API Design

type SearchOptions struct {
    // ... existing fields ...
    
    // Community diversity
    UseCommunity    bool // Enable community-based diversity (default: true)
    MaxPerCommunity int  // Max results per community (default: 5)
}

type SearchResult struct {
    // ... existing fields ...
    
    // Community metadata
    CommunityID   int    // Semantic community ID
    CommunityInfo string // Community description
}

Benefits

  • Better cross-module coverage in search results
  • Semantic boundaries vs structural boundaries
  • Helps users discover related code in different modules
  • Proven effective in Python implementation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions