Skip to content

Conversation

@taha-yassine
Copy link
Collaborator

@taha-yassine taha-yassine commented Dec 29, 2025

Small improvement to the repo cloning script to speed up the process and reduce the local repos sizes.

Requires git >=2.49

Repo cloning was performed as part of the training job. Each trajectory would check if a repo was already cloned, and if not, would proceed to clone it. This slowed down training at lot.

The already available cloning script allowed cloning the required repos as a separate independent step, but it wasn't optimized. Most repos in the datasets were used at different commits, but the cloning process would retrieve the whole repo history for each instance. In addition, git files weren't mutualized across instances of the same repos.

This PR refactors the cloning script so that repos are first fetched at the different commits in the input dataset, and under mutualized bare repos. Then each commit is exported in a separate directory for the training run.

With the SWE-bench Lite dataset, this PR helps reduce the total size of the cloned repos from ~75GB to ~12GB, and helps improve the download speed from several minutes (just under an hour if I remember correctly from previous testing) to just under a minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants