Improve repo cloning #51
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Small improvement to the repo cloning script to speed up the process and reduce the local repos sizes.Requires git >=2.49Repo cloning was performed as part of the training job. Each trajectory would check if a repo was already cloned, and if not, would proceed to clone it. This slowed down training at lot.
The already available cloning script allowed cloning the required repos as a separate independent step, but it wasn't optimized. Most repos in the datasets were used at different commits, but the cloning process would retrieve the whole repo history for each instance. In addition, git files weren't mutualized across instances of the same repos.
This PR refactors the cloning script so that repos are first fetched at the different commits in the input dataset, and under mutualized bare repos. Then each commit is exported in a separate directory for the training run.
With the SWE-bench Lite dataset, this PR helps reduce the total size of the cloned repos from ~75GB to ~12GB, and helps improve the download speed from several minutes (just under an hour if I remember correctly from previous testing) to just under a minute.