Skip to content

Replace Terrapin with BuildExecution to fix process deadlock when output to stderr is fills buffer#53

Merged
gyfelton merged 20 commits intomasterfrom
gyfelton/fix-read-stream-stuck
Mar 13, 2023
Merged

Replace Terrapin with BuildExecution to fix process deadlock when output to stderr is fills buffer#53
gyfelton merged 20 commits intomasterfrom
gyfelton/fix-read-stream-stuck

Conversation

@gyfelton
Copy link
Member

@gyfelton gyfelton commented Mar 8, 2023

Why this change

In our CI cluster, we run into fastclone got stuck because of this bug: thoughtbot/terrapin#5
We verified that after this change it run smoothly after consistently reproducing it.
Our codebase is so big that even git remote update --prune returns a huge output causing this problem to manifest
When we interrupt the stuck process, we also see it stuck at the read stream function as indicated by above PR
Also verbose mode does not redirect output from git operations inside Terrapin to stdout for some reason. Therefore we think this is a good fix.

Also add log when cleaning cache because when lots of CI machiens got cache cleaned, we will throttle the network by trying to populate the cache at the same time. So adding log to warn about this.

  • Fix spec, should have spec failing or need to add some
  • Update version file and Gemfile.lock

# the cache directory entirely. This may cause the current clone to fail, but if the
# underlying error from git is transient it will not affect future clones.
puts '[WARN] Removing the fastclone cache.'
FileUtils.remove_entry_secure(mirror, force: true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be more thoughtful here and only clear the cache directory in the case of certain failures that we know aren't recoverable, given that removing the cache can lead to thundering herd issues for us.

For example, for the recent issue where we had two case insensitive matching branch names, I don't think clearing the cache helped resolve the issue, at least not until the upstream ref has been deleted. We'd have to experiment to see if clearing the cache was actually necessary to resolve the broken state once the bad ref was removed server side.

error: cannot lock ref 'refs/heads/<branch>': is at <sha1> but expected <sha2>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So right now we clear cache only if we consider it as a retriable error, so if we converge the logic here to the one at the end (rescue block for. with_git_mirror method) at least for above error we face, we won't clear cache and will just fail hard? I think it's totally reasonable as we have it better than before (clear cache blindly)

gyfelton added 2 commits March 9, 2023 16:18
Also use the same retriable logic around another git clone operation
@gyfelton
Copy link
Member Author

gyfelton commented Mar 9, 2023

@justinseanmartin Another look? In my manual test i have every call to fail_on_error/fail_pipe_on_error called with submodules and with/without -v set against git fastclone.

The output generated when no -v is still more than the original very silent operation. But
I also tested adding GIT_ALLOW_PROTOCOL=file at the beginning and indeed it failed with https protocol not supported error
The last thing I need to test is probably the two rescue blocks changed/added: plan is to manually raise error with/without the matching output and verify it indeed clear cache and retry

  • Do above manual raising of error

@gyfelton gyfelton changed the title Fix stuck at handling extra long stderr during clone or remote update Replace Terrapin with BuildExecution to fix process deadlock when output to stderr is fills buffer Mar 10, 2023
Copy link
Collaborator

@justinseanmartin justinseanmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Few small things, but otherwise LGTM!

@gyfelton gyfelton merged commit 0ac13db into master Mar 13, 2023
@gyfelton gyfelton deleted the gyfelton/fix-read-stream-stuck branch March 13, 2023 16:34
gyfelton added a commit that referenced this pull request Aug 31, 2023
While cd will change where the current dir is, because it happens in a thread it should not impact the subsequent operations. This is also what used be before #53 is landed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants