Skip to content

Repository clone size is unnecessarily large due to deleted binary in history and untracked large files #959

@KeitaW

Description

@KeitaW

Problem

The .git directory weighs 177 MB despite roughly 72 MB of working tree content, making fresh clones significantly slower than necessary. A single 164 MB pack file containing 12,525 objects accounts for nearly all of this overhead, and git count-objects -vH reports 5,599 non-delta objects, indicating a substantial proportion of binary or incompressible content.

Investigation methodology

The analysis combined several approaches to identify where the bloat originates:

  • Measured the .git directory size with du -sh .git and inspected ls -lh .git/objects/pack/ to quantify the overhead at 177 MB total, dominated by a single 164 MB pack file.
  • Ran git count-objects -vH for git's own size accounting, confirming 5,599 non-delta objects — a high count that points to many binary files.
  • Enumerated the largest blobs across all history using git rev-list --objects --all | git cat-file --batch-check to surface objects that persist in the pack regardless of whether they still exist in the current working tree.
  • Scanned the working tree for large files by extension (GIF, ZIP, PNG, tar.bz2, CSV, JSON) using find with size sorting to identify current binary bloat.
  • Checked git log --diff-filter=D --summary to identify large files that were committed and later deleted but still consume space in the object database.
  • Verified .gitattributes contents and checked git lfs ls-files — Git LFS is not configured.
  • Measured per-directory sizes with du -sh */ to understand which areas of the repo contribute most.

Root cause analysis

The following table lists the sources of bloat in descending order of impact:

Source Size Notes
Deleted miniconda.sh at 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh 95.8 MB Added in commit d0227b2, deleted in commit 32b3313 (Oct 2023). Still stored in the git object database and downloaded by every clone. This single file accounts for ~54% of the .git directory size.
Demo GIF files (automate-smhp-demo.gif, automate-smhp-eks-demo.gif) ~30 MB Two animated GIFs used for documentation.
Lambda function ZIP (grafana-service-token-lambda-function.zip) 15 MB Binary archive checked into the tree.
PNG screenshots across 0.docs/ and other directories ~10 MB Dozens of image files spread across multiple directories.
Other binaries (hwloc-2.9.2-h2bc3f7f_0.tar.bz2, slurm-esm1nv-train-102.out, etc.) ~7 MB Miscellaneous archives and log files.

Proposed next steps

The most impactful remediation is purging the deleted miniconda.sh from history, which alone would reclaim roughly 96 MB. Beyond that, migrating current large binaries to Git LFS and adding guardrails would prevent recurrence.

Specifically:

  1. Purge miniconda.sh from history using git filter-repo --path 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh --invert-paths.
  2. Migrate large binaries (GIFs, ZIPs, archives) to Git LFS via git lfs migrate import.
  3. Add .gitattributes rules to enforce LFS tracking for *.gif, *.zip, *.tar.bz2, *.png, and similar extensions.
  4. The expected result is roughly a 60% reduction in clone size (from 177 MB down to approximately 60–70 MB).

Rollout considerations

Because history rewriting is a breaking change for existing clones and forks, the following precautions are warranted:

  • Announce to contributors before the history rewrite so they can prepare.
  • Combine the filter-repo purge and LFS migration into a single coordinated force push to minimize disruption.
  • Update contributing guidelines to mention git lfs install as a prerequisite for development.
  • Provide re-sync instructions for existing forks (e.g., fresh clone or git fetch --refetch).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions