Skip to content

Implement proper diff algorithm for accurate line change calculation #43

@sanggggg

Description

@sanggggg

Background

PR #41 introduced unified tool operations tracking with file change metrics. Currently, the system calculates line changes using a simple approach: counting the total number of newlines in old_string and new_string from Edit operations.

Current Implementation (src/tools/parsers/edit.rs:104-132):

pub fn lines_before(&self) -> Option<i32> {
    self.old_string.as_ref().map(|s| {
        if s.is_empty() { return 0; }
        let newline_count = s.chars().filter(|&c| c == '\n').count();
        if s.ends_with('\n') { newline_count as i32 }
        else { (newline_count + 1) as i32 }
    })
}

Current Calculation Logic (src/models/tool_operation.rs:145-169):

pub fn with_line_metrics(mut self, lines_before: Option<i32>, lines_after: Option<i32>) -> Self {
    if let Some(meta) = &mut self.file_metadata {
        meta.lines_before = lines_before;
        meta.lines_after = lines_after;
        
        // Simple subtraction - not actual diff!
        if let (Some(before), Some(after)) = (lines_before, lines_after) {
            if after > before {
                meta.lines_added = Some(after - before);
                meta.lines_removed = Some(0);
            } else if before > after {
                meta.lines_added = Some(0);
                meta.lines_removed = Some(before - after);
            }
        }
    }
    self
}

Problem

The current approach has several accuracy issues:

  1. Inaccurate Change Detection: When old_string has 10 lines and new_string has 15 lines, the system reports "5 lines added, 0 removed" - but this ignores that some of those 10 original lines might have been modified or deleted
  2. No Context Awareness: Cannot distinguish between:
    • Adding 5 new lines to existing 10 lines (actual: +5, -0)
    • Replacing 10 lines with completely different 15 lines (actual: +15, -10)
    • Adding 5 lines while modifying some of the original 10 lines
  3. Misleading Metrics: The total_line_changes() and net_line_change() methods rely on these inaccurate numbers, leading to incorrect analysis of code modification scope

Proposed Solution

Implement a proper diff algorithm to calculate actual line additions and deletions:

Phase 1: Myers Diff Algorithm Integration

  1. Add dependency for diff calculation:

    [dependencies]
    similar = "2.3"  # or another diff library like `diff` or `dissimilar`
  2. Enhance EditData parser (src/tools/parsers/edit.rs):

    use similar::{ChangeTag, TextDiff};
    
    impl EditData {
        /// Calculate actual line changes using Myers diff algorithm
        pub fn calculate_diff_metrics(&self) -> Option<DiffMetrics> {
            if let (Some(old), Some(new)) = (&self.old_string, &self.new_string) {
                let diff = TextDiff::from_lines(old, new);
                
                let mut added = 0;
                let mut removed = 0;
                let mut unchanged = 0;
                
                for change in diff.iter_all_changes() {
                    match change.tag() {
                        ChangeTag::Insert => added += 1,
                        ChangeTag::Delete => removed += 1,
                        ChangeTag::Equal => unchanged += 1,
                    }
                }
                
                Some(DiffMetrics { lines_added: added, lines_removed: removed, lines_unchanged: unchanged })
            } else {
                None
            }
        }
    }
    
    pub struct DiffMetrics {
        pub lines_added: i32,
        pub lines_removed: i32,
        pub lines_unchanged: i32,
    }
  3. Update ToolOperation builder (src/models/tool_operation.rs):

    pub fn with_accurate_diff_metrics(mut self, metrics: DiffMetrics) -> Self {
        if let Some(meta) = &mut self.file_metadata {
            meta.lines_added = Some(metrics.lines_added);
            meta.lines_removed = Some(metrics.lines_removed);
            // Keep before/after for total line counts
            meta.lines_before = Some(metrics.lines_removed + metrics.lines_unchanged);
            meta.lines_after = Some(metrics.lines_added + metrics.lines_unchanged);
        }
        self
    }
  4. Update ImportService (src/services/import_service.rs:514-527):

    "Edit" => {
        let parser = EditParser;
        if let Ok(parsed) = parser.parse(tool_use) {
            if let ToolData::Edit(data) = parsed.data {
                operation = operation
                    .with_file_path(data.file_path.clone())
                    .with_file_type(data.is_code_file(), data.is_config_file())
                    .with_edit_flags(data.is_bulk_replacement(), data.is_refactoring());
                
                // Use accurate diff metrics instead of simple line counting
                if let Some(metrics) = data.calculate_diff_metrics() {
                    operation = operation.with_accurate_diff_metrics(metrics);
                }
            }
        }
    }

Phase 2: Enhanced Analytics

Once accurate diff metrics are available, add repository queries for:

  • Code churn analysis (lines added + removed per file)
  • Refactoring detection improvements (high churn but similar structure)
  • Modification patterns (mostly additions vs. mostly deletions vs. balanced edits)

Benefits

  1. Accurate Metrics: Precise tracking of actual code changes
  2. Better Analysis: Improved refactoring detection, code review insights, and productivity metrics
  3. Foundation for Future Features:
    • Time-series analysis of code evolution
    • Identification of frequently modified code sections
    • Better integration with retrospection analysis (see issue #TBD)

Testing Considerations

  • Add unit tests with various edit scenarios:
    • Pure addition (append to file)
    • Pure deletion (remove lines)
    • Mixed edits (add some, remove some, keep some)
    • Complete replacement (no common lines)
    • Whitespace-only changes
  • Verify backward compatibility with existing data (migration not needed, as we only improve future calculations)

Related

Priority

Medium - Improves data accuracy but system is functional with current approach

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions