Improve repo cloning #51

taha-yassine · 2025-12-29T17:53:37Z

~~Small improvement to the repo cloning script to speed up the process and reduce the local repos sizes.~~

~~Requires git >=2.49~~

Repo cloning was performed as part of the training job. Each trajectory would check if a repo was already cloned, and if not, would proceed to clone it. This slowed down training at lot.

The already available cloning script allowed cloning the required repos as a separate independent step, but it wasn't optimized. Most repos in the datasets were used at different commits, but the cloning process would retrieve the whole repo history for each instance. In addition, git files weren't mutualized across instances of the same repos.

This PR refactors the cloning script so that repos are first fetched at the different commits in the input dataset, and under mutualized bare repos. Then each commit is exported in a separate directory for the training run.

With the SWE-bench Lite dataset, this PR helps reduce the total size of the cloned repos from ~75GB to ~12GB, and helps improve the download speed from several minutes (just under an hour if I remember correctly from previous testing) to just under a minute.

taha-yassine added 4 commits December 29, 2025 18:52

Improve repo cloning

ffa58c4

Improve repo cloning cont

d1098f7

Refactor repo cloning

13b2fd0

Adapt training to new cloned repos structure

21571c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve repo cloning #51

Improve repo cloning #51

Uh oh!

taha-yassine commented Dec 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve repo cloning #51

Are you sure you want to change the base?

Improve repo cloning #51

Uh oh!

Conversation

taha-yassine commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taha-yassine commented Dec 29, 2025 •

edited

Loading