-
Notifications
You must be signed in to change notification settings - Fork 0
Mismatch between gut and git
#202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Investigation: Why
|
Root Cause: Unicode Normalization Conflict (NFD vs NFC)This analysis was done by Claude. The Real ProblemThe issue was not a corrupted git index, but rather a Unicode normalization conflict between macOS and Git history. The filename
What Was Happening
VerificationHex dump of git index confirmed the duplicate: Solution AppliedCreated a new commit that removes the NFC duplicate, keeping only the NFD version (which matches macOS filesystem behavior): # Created a corrected tree with only NFD version
printf "100755 blob e4e141d4cfb60eb9f01b3308e90dfd92697356df\tUiT-templat-nynorsk-bla\xcc\x8att.kth\n" | git mktree
# Created new commit with fixed tree
git commit-tree <tree-hash> -p HEAD -m "Fix: Remove NFC duplicate of filename with NFD normalization"
# Moved HEAD to new commit
git reset --hard <commit-hash>
# Force-pushed to remote
git push --force-with-leaseResultAfter the fix:
ConclusionThis was not a bug in |
Proposals to Prevent Unicode Normalization Issues in the FutureSuggestions by Claude. Now that we understand the root cause (NFD vs NFC normalization conflicts), here are six proposals to prevent similar issues going forward: 1. Git Configuration:
|
|
@dylanhand wdyt? @albbas? @Phaqui? @flammie? |
Update: Corrected Fix AppliedUpdate by Claude. Issue with Previous FixMy previous fix was incorrect. I created a commit with NFD normalization, but when What went wrong:
Corrected FixCreated a new commit with NFC normalization instead: # Create tree with NFC-normalized filename
printf "100755 blob e4e141d4cfb60eb9f01b3308e90dfd92697356df\tUiT-templat-nynorsk-bl\xc3\xa5tt.kth\n" | git mktree
# Create new commit based on HEAD^ (before wrong NFD commit)
git commit-tree <tree-hash> -p HEAD^ -m "Fix: Use NFC normalization for filename (compatible with core.precomposeUnicode)"
git reset --hard <commit-hash>
git push --force-with-leaseWhy NFC is CorrectWhen
Current StateAfter the corrected fix:
Lesson LearnedThe correct approach is:
|
Scope of the Problem: Widespread NFD Normalization IssuesAnalysis by Claude. DiscoveryAfter investigating other repositories in the organization, we've discovered that the Unicode normalization issue is much more widespread than initially thought. Scale:
ImpactWith
Proposed SolutionsWe need tooling to detect and fix these issues. Here are the options: Option 1: Integrate into
|
|
I agree on using NFC. If we can setup something server side that requires no user setup it can be useful, client side commit hooks and extra programs tend to have limited success and probably block innocent committers from working, Gut doing sanity checks on config values and such sounds like a good idea, if it gets implemented it should just as well be ran unconditionally on every invocation. Some doctoring commands sound useful too. |
|
NFC means "maximally precomposed letters", which is not the same as precomposed letters only. As we know very well there are several languages in our infra that use letters that can only be represented as base+combining diacritic. The question is then how do we most efficiently ensure that NFC is what is stored internally in the git database, so that all OSs can work transparently with all filenames? As
(from: https://www.git-tower.com/help/guides/faq-and-tips/faq/unicode-filenames/mac) So, as Flammie and Claude suggest, we continue with:
@dylanhand @albbas @Phaqui WDYT? |
|
Agree it'd be a good idea to have detection built into Also reasonable to have a separate tool for fixing this. It seems unlikely that we're the first to encounter this issue so I wonder if something already exists. I had a quick search now but so far haven't found anything that looks excellent. |
|
Ah here's a library written in rust for normalizing to NFC: https://github.com/cr0sh/jaso |
|
It is for normalising filenames, not fix the issue in the git index/database. But it is good to know, and can surely be used. |
- Scans all files in git tree for NFD-normalized filenames - Reports files with Unicode decomposed characters - Provides recommendations for fixing issues - Uses unicode-normalization crate to detect non-NFC filenames
- Added progress bar during file checking (using process_with_progress) - Per-owner summaries when using --all-owners - 'All files are correctly encoded' message when no issues found - List issues per owner before final summary - Final summary table with separate Owner and Repository columns - Issues grouped by repo with file counts in per-owner reports
It's the filenames that are checked, not the file contents
Consistent terminology: it's the filenames being checked, not file contents
Now displays which owner was checked, even when using default owner
- health_check: Remove debug code and add comprehensive documentation explaining Unicode normalization limitations (Cyrillic + combining macron has no NFC precomposed form) - git/status: Fix libgit2 status misclassification for NFD filenames * libgit2 reports NFD files as 'deleted' when core.precomposeUnicode=true * Added post-processing to reclassify files that exist on disk as 'modified' * This matches git CLI behavior and fixes the gut/git status mismatch Fixes #202 - gut status now correctly reports 14 modified files instead of 16 deleted + 2 untracked, matching git status output exactly.
Extends health check to detect files with identical names when compared case-insensitively. Critical for case-insensitive filesystems (macOS/Windows) where git-lfs gets confused. Successfully tested on audio-sjd repository - detected all 14 case-duplicate groups.
The command now detects both NFD normalization issues and case-duplicate files, so the help text should reflect both capabilities.
Changed from generic 'filesystem compatibility issues' to explicitly stating the command checks for NFD normalization issues and case-duplicate filenames.
Marked NFD/NFC normalization check and case-duplicate detection as completed. Added new item for case-duplicate filenames check.
Implements check #1 from the TODO list: verify that Git LFS is installed. This warning is shown whenever git-lfs is not available, helping users understand why LFS repositories may not work correctly.
The system health check section now provides: 1. Clear status for each test (✓ passed / ✗ failed) 2. List of all tests performed: - core.precomposeUnicode setting (macOS only) - Git LFS installation 3. Detailed remediation steps only when issues are found This makes it easy to see at a glance which tests passed and which need attention, with clear instructions on how to fix any problems.
Added features: 1. Git version check: Verifies Git >= 1.7.10 is installed 2. Improved core.precomposeUnicode check: Only warns if explicitly set to false (empty/unset is OK as default is true since Git 1.7.10) 3. Display actual values in system health report: - Git version (e.g., 2.52.0) - core.precomposeUnicode value (true, false, or '(default: true)') - Git LFS status (installed / not installed) This makes it easy to see not just pass/fail status but also what is actually configured on the system.
Done, see work in this issue (which has been turned into a PR).
Also done and tested. Seems to work correctly on the cases I tested it on. See this repo: https://github.com/divvun/nfd-fixer |
- Modified print_recommendations() to accept parameters indicating which issues were found - Only display NFD normalization recommendations when NFD issues are detected - Only display case-duplicate recommendations when case-duplicate issues are detected - Updated recommendation to reference nfd-fixer tool instead of jaso - Includes link to https://github.com/divvun/nfd-fixer
- Add owner name to progress message (e.g. 'Checking snomos' instead of just 'Checking') - Add blank line between owners when checking all owners with -a flag - Remove duplicate newline from print_owner_summary to avoid double spacing - Makes output clearer and easier to read when checking multiple owners
- Detect files larger than configurable threshold (default 50 MB) - Separate files matching .gitignore (should be removed) from regular large files (should use LFS) - Display threshold in all reports for transparency - Show distinct styling: red warnings for ignored files, yellow for LFS candidates - Add comprehensive recommendations for both scenarios - Configurable threshold via --large-file-mb parameter - Fix double slash in file paths by trimming trailing slashes
Remove inner parentheses from get_precompose_unicode_value() return values since outer parentheses are added when printing the system configuration.
- Warn if core.autocrlf is set to 'true' on Unix systems (macOS/Linux) - This setting can corrupt binary files and cause false change reports - Recommend using .gitattributes files instead for consistent team behavior - Display autocrlf value in system configuration report - Values 'false', 'input', or unset are all acceptable
- Add LongPathIssue struct to track files with long paths/names - Add configurable thresholds: filename_length_bytes (default 200) and path_length_bytes (default 400) - Integrate length checking into check_repo_for_large_files() for efficiency - Check both filename length and full path length - Return tuple of (large_files, large_ignored_files, long_paths) - Update all data structures to include long_path_issues - Sort long paths by path length (longest first) Next: Add display and recommendation logic
- Show long paths in all summary views (compact/detailed/final) - Display both filename bytes and path bytes separately - Add table view with repository grouping - Include thresholds in warning messages - Add recommendations for shortening paths and filenames - Note Windows compatibility issues and NFD impact - Mark punkt 6 complete in health.rs
|
This PR (was: issue) is now ready for review. What I have done (with a lot of help from Claude) is:
The various default thresholds can be changed with options to the new command. When issues are detected, recommended actions are suggested in the report printed. Comments and reviews welcome 🙂 @dylanhand @albbas @Phaqui @flammie When this is done and merged, the next step is to re-add |
- Describe all file content checks (NFD, case-duplicates, large files, long paths) - List all system configuration checks (Git version, precomposeUnicode, autocrlf, LFS) - Add blank lines between bullet points for better readability - Clarify that large file detection separates .gitignore matches - Note path length limits on Windows (260 char limit) - Emphasize that detailed recommendations are provided
Major improvements: - Add IssueDisplay trait for generic issue handling - Implement count_affected_repos() helper (eliminates 4 duplications) - Implement group_by_repo() helper (eliminates 5 duplications) - Refactor print_owner_summary() to use generic helpers - Refactor print_single_owner_summary() to use generic helpers Benefits: - ~80 lines of duplicated code removed - Easier to add new issue types in the future - More maintainable and consistent output - All tests passing, output unchanged Technical details: - IssueDisplay trait provides repo() and owner() methods - count_affected_repos() counts unique repos with issues - group_by_repo() creates sorted BTreeMap of issues by repo - Each issue type implements IssueDisplay trait
Major improvements: - Add IssueDisplay trait for generic issue handling - Implement count_affected_repos() helper (eliminates 4 duplications) - Implement group_by_repo() helper (eliminates 5 duplications) - Refactor print_owner_summary() to use generic helpers - Refactor print_single_owner_summary() to use generic helpers Benefits: - ~80 lines of duplicated code removed - Easier to add new issue types in the future - More maintainable and consistent output - All tests passing, output unchanged Technical details: - IssueDisplay trait provides repo() and owner() methods - count_affected_repos() counts unique repos with issues - group_by_repo() creates sorted BTreeMap of issues by repo - Each issue type implements IssueDisplay trait
Fixed issues: - Remove unused 'owner' field from all issue structs (5 warnings) - Remove unused 'owner()' method from IssueDisplay trait - Remove unused 'lfs_installed' variable - Fix irrefutable if let pattern (changed to direct conversion) - Remove unused is_git_lfs_installed() function All structs now only contain fields that are actually used. IssueDisplay trait simplified to only repo() method. All code compiles without any warnings. All tests passing.
gut pull -areports some dirty repos for me:In the
divvungiellateknoorg, the repotemplate-keynote-UiTis reported as dirty. Andgut status -o divvungiellateknoreports that one file is deleted:But when checking with
git, nothing seems to be missing or deleted:So what's going on?