Skip to content

Stats failed: From UTF-8 error: invalid utf-8 sequence of 1 bytes from index 81909 #503

@ysjemmm

Description

@ysjemmm

I am not a Rust developer. I consulted an AI and it provided some solutions.

Problem Analysis

Error Message

Stats failed: From UTF-8 error: invalid utf-8 sequence of 1 bytes from index 81909

Root Cause

When executing the git-ai stats command, the code attempts to convert Git diff output to a UTF-8 string at the following location:

Location: src/git/repository.rs line 1773

let diff_output = String::from_utf8(output.stdout)?;

This error is triggered when the repository contains:

  1. Binary files (images, compiled files, archives, etc.)
  2. Non-UTF-8 encoded text files (such as GBK, Latin-1, etc.)
  3. Files containing invalid UTF-8 sequences

In your case, commit e022db36e2d16b63d8477439451b05599c3da117 likely contains:

  • iOS project binary resource files (.png, .jpg, .xcassets, etc.)
  • Build artifacts or dependency libraries
  • Files with special encodings

Why This Affects the Stats Command

The git-ai stats command execution flow:

  1. Get commit diff statistics
  2. Call diff_added_lines() to parse added line numbers
  3. Fails here: Attempts to convert Git diff output to UTF-8 string
  4. Run git-ai blame on each added line to determine AI attribution

Solutions

Solution 1: Use UTF-8 Lossy Conversion (Recommended)

Change strict UTF-8 conversion to lossy conversion, which automatically replaces invalid characters:

// Before (src/git/repository.rs:1773)
let diff_output = String::from_utf8(output.stdout)?;

// After
let diff_output = String::from_utf8_lossy(&output.stdout).to_string();

Pros:

  • Won't fail due to non-UTF-8 content
  • For diff parsing, replacing invalid characters doesn't affect results (only parsing line numbers and filenames)
  • Simple and straightforward

Cons:

  • May lose some special character information (but minimal impact on stats command)

Solution 2: Add Binary File Filtering

Add --no-binary option to Git diff command to skip binary files:

// Modify src/git/repository.rs:1745
args.push("--no-binary".to_string());  // Add this line
args.push("-U0".to_string());

Pros:

  • Avoids processing binary files at the source
  • Maintains UTF-8 strictness

Cons:

  • Still can't handle non-UTF-8 encoded text files
  • May miss some files that need to be counted

Solution 3: Combined Approach (Best)

Combine Solutions 1 and 2:

// src/git/repository.rs
pub fn diff_added_lines(
    &self,
    from_ref: &str,
    to_ref: &str,
    pathspecs: Option<&HashSet<String>>,
) -> Result<HashMap<String, Vec<u32>>, GitAiError> {
    let mut args = self.global_args_for_exec();
    args.push("diff".to_string());
    args.push("-U0".to_string());
    args.push("--no-color".to_string());
    args.push("--no-binary".to_string());  // Add: skip binary files
    args.push(from_ref.to_string());
    args.push(to_ref.to_string());

    // ... pathspecs handling ...

    let output = exec_git(&args)?;
    // Use lossy conversion instead of strict conversion
    let diff_output = String::from_utf8_lossy(&output.stdout).to_string();

    let mut result = parse_diff_added_lines(&diff_output)?;

    // ... subsequent processing ...
}

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions