-
-
Notifications
You must be signed in to change notification settings - Fork 403
Fix concurrent publish operations causing missing package files #1511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix concurrent publish operations causing missing package files #1511
Conversation
76a7b8f to
8a47888
Compare
|
Sounds my new test is giving timeout (the new test takes longer than 2 minutes, of course depending on the load of the machine). @iofq or @neolynx would you mind taking a look. I can try to fix the timeout later, but an early review would be really appreciated. I have found these multiple bugs after some stress testing I have done due to production bugs we had randomly. |
8a47888 to
a9591c7
Compare
When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors. Three distinct but related race conditions were identified and fixed: 1. Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition. 2. Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would: - Query database for referenced files (not seeing thread A's uncommitted files) - Scan pool directory (seeing thread A's files) - Delete thread A's files as "orphaned" Fixed by implementing pool-sibling locking: acquire locks on ALL published repositories sharing the same storage and prefix before publish/cleanup. 3. Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by: - Adding prefix-level locking to serialize cleanup operations - Removing ref subtraction that incorrectly marked shared files as orphaned - Forcing database reload before cleanup to see recent commits The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired. Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.
a9591c7 to
47b362c
Compare
| @mkdir -p /tmp/aptly-etcd-data; system/t13_etcd/start-etcd.sh > /tmp/aptly-etcd-data/etcd.log 2>&1 & | ||
| @echo "\e[33m\e[1mRunning go test ...\e[0m" | ||
| faketime "$(TEST_FAKETIME)" go test -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret | ||
| faketime "$(TEST_FAKETIME)" go test -timeout 20m -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests should be run with -race to detect race conditions
|
@agustinhenze I cannot get the unit tests to work... looks like there is a deadlock now, the evil twin of race conditions... I wonder if such long running tests would not be better implemented as system tests, testing api and cmdline. Did I get this right, the tests should run things concurrently for a while and not loose files ? |
|
it hangs in TestIdenticalPackageRace:430 I believe, here the logs: here is the backtrace after the timeout: |
I think this assumption is wrong. The task list is unlocked while the task is still running, multiple publish tasks acquire the database and run concurrently, creating chaos and deadlocks. a bit more logging reveals: apiReposPackageFromDir should not modify the database in parallel tasks... what do you think ? |
When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors.
Three distinct but related race conditions were identified and fixed:
Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.
Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would:
Fixed by implementing pool-sibling locking: acquire locks on ALL published
repositories sharing the same storage and prefix before publish/cleanup.
Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by:
The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.
Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.
Checklist
AUTHORS