-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Problem
The .git directory weighs 177 MB despite roughly 72 MB of working tree content, making fresh clones significantly slower than necessary. A single 164 MB pack file containing 12,525 objects accounts for nearly all of this overhead, and git count-objects -vH reports 5,599 non-delta objects, indicating a substantial proportion of binary or incompressible content.
Investigation methodology
The analysis combined several approaches to identify where the bloat originates:
- Measured the
.gitdirectory size withdu -sh .gitand inspectedls -lh .git/objects/pack/to quantify the overhead at 177 MB total, dominated by a single 164 MB pack file. - Ran
git count-objects -vHfor git's own size accounting, confirming 5,599 non-delta objects — a high count that points to many binary files. - Enumerated the largest blobs across all history using
git rev-list --objects --all | git cat-file --batch-checkto surface objects that persist in the pack regardless of whether they still exist in the current working tree. - Scanned the working tree for large files by extension (GIF, ZIP, PNG, tar.bz2, CSV, JSON) using
findwith size sorting to identify current binary bloat. - Checked
git log --diff-filter=D --summaryto identify large files that were committed and later deleted but still consume space in the object database. - Verified
.gitattributescontents and checkedgit lfs ls-files— Git LFS is not configured. - Measured per-directory sizes with
du -sh */to understand which areas of the repo contribute most.
Root cause analysis
The following table lists the sources of bloat in descending order of impact:
| Source | Size | Notes |
|---|---|---|
Deleted miniconda.sh at 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh |
95.8 MB | Added in commit d0227b2, deleted in commit 32b3313 (Oct 2023). Still stored in the git object database and downloaded by every clone. This single file accounts for ~54% of the .git directory size. |
Demo GIF files (automate-smhp-demo.gif, automate-smhp-eks-demo.gif) |
~30 MB | Two animated GIFs used for documentation. |
Lambda function ZIP (grafana-service-token-lambda-function.zip) |
15 MB | Binary archive checked into the tree. |
PNG screenshots across 0.docs/ and other directories |
~10 MB | Dozens of image files spread across multiple directories. |
Other binaries (hwloc-2.9.2-h2bc3f7f_0.tar.bz2, slurm-esm1nv-train-102.out, etc.) |
~7 MB | Miscellaneous archives and log files. |
Proposed next steps
The most impactful remediation is purging the deleted miniconda.sh from history, which alone would reclaim roughly 96 MB. Beyond that, migrating current large binaries to Git LFS and adding guardrails would prevent recurrence.
Specifically:
- Purge
miniconda.shfrom history usinggit filter-repo --path 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh --invert-paths. - Migrate large binaries (GIFs, ZIPs, archives) to Git LFS via
git lfs migrate import. - Add
.gitattributesrules to enforce LFS tracking for*.gif,*.zip,*.tar.bz2,*.png, and similar extensions. - The expected result is roughly a 60% reduction in clone size (from 177 MB down to approximately 60–70 MB).
Rollout considerations
Because history rewriting is a breaking change for existing clones and forks, the following precautions are warranted:
- Announce to contributors before the history rewrite so they can prepare.
- Combine the
filter-repopurge and LFS migration into a single coordinated force push to minimize disruption. - Update contributing guidelines to mention
git lfs installas a prerequisite for development. - Provide re-sync instructions for existing forks (e.g., fresh clone or
git fetch --refetch).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status