Skip to content

Conversation

@agustinhenze
Copy link
Contributor

When multiple repository operations execute concurrently on shared pool directories, race conditions could cause .deb files to be deleted despite appearing in repository metadata, resulting in apt 404 errors.

Three distinct but related race conditions were identified and fixed:

  1. Package addition vs publish race: When packages are added to a local repository that is already published, the publish operation could read stale package references before the add transaction commits. Fixed by locking all published repositories that reference the local repo during package addition.

  2. Pool file deletion race: When multiple published repositories share the same pool directory (same storage+prefix) and publish concurrently, cleanup operations could delete each other's newly created files. The cleanup in thread B would:

    • Query database for referenced files (not seeing thread A's uncommitted files)
    • Scan pool directory (seeing thread A's files)
    • Delete thread A's files as "orphaned"

    Fixed by implementing pool-sibling locking: acquire locks on ALL published
    repositories sharing the same storage and prefix before publish/cleanup.

  3. Concurrent cleanup on same prefix: Multiple distributions publishing to the same prefix concurrently could have cleanup operations delete shared files. Fixed by:

    • Adding prefix-level locking to serialize cleanup operations
    • Removing ref subtraction that incorrectly marked shared files as orphaned
    • Forcing database reload before cleanup to see recent commits

The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.

Test coverage includes concurrent publish scenarios that reliably reproduced all three bugs before the fixes.

Checklist

  • unit-test added
  • functional test added/updated (if change is functional)
  • author name in AUTHORS

@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch 3 times, most recently from 76a7b8f to 8a47888 Compare December 4, 2025 18:15
@agustinhenze
Copy link
Contributor Author

Sounds my new test is giving timeout (the new test takes longer than 2 minutes, of course depending on the load of the machine). @iofq or @neolynx would you mind taking a look. I can try to fix the timeout later, but an early review would be really appreciated.

I have found these multiple bugs after some stress testing I have done due to production bugs we had randomly.

@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch from 8a47888 to a9591c7 Compare December 4, 2025 18:41
When multiple repository operations execute concurrently on shared pool
directories, race conditions could cause .deb files to be deleted despite
appearing in repository metadata, resulting in apt 404 errors.

Three distinct but related race conditions were identified and fixed:

1. Package addition vs publish race: When packages are added to a local
   repository that is already published, the publish operation could read
   stale package references before the add transaction commits. Fixed by
   locking all published repositories that reference the local repo during
   package addition.

2. Pool file deletion race: When multiple published repositories share the
   same pool directory (same storage+prefix) and publish concurrently, cleanup
   operations could delete each other's newly created files. The cleanup in
   thread B would:
   - Query database for referenced files (not seeing thread A's uncommitted files)
   - Scan pool directory (seeing thread A's files)
   - Delete thread A's files as "orphaned"

   Fixed by implementing pool-sibling locking: acquire locks on ALL published
   repositories sharing the same storage and prefix before publish/cleanup.

3. Concurrent cleanup on same prefix: Multiple distributions publishing to the
   same prefix concurrently could have cleanup operations delete shared files.
   Fixed by:
   - Adding prefix-level locking to serialize cleanup operations
   - Removing ref subtraction that incorrectly marked shared files as orphaned
   - Forcing database reload before cleanup to see recent commits

The existing task system serializes operations based on resource locks,
preventing these race conditions when proper lock sets are acquired.

Test coverage includes concurrent publish scenarios that reliably reproduced
all three bugs before the fixes.
@agustinhenze agustinhenze force-pushed the fix-concurrent-publish-race-conditions branch from a9591c7 to 47b362c Compare December 4, 2025 19:18
@neolynx neolynx self-assigned this Dec 4, 2025
@mkdir -p /tmp/aptly-etcd-data; system/t13_etcd/start-etcd.sh > /tmp/aptly-etcd-data/etcd.log 2>&1 &
@echo "\e[33m\e[1mRunning go test ...\e[0m"
faketime "$(TEST_FAKETIME)" go test -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret
faketime "$(TEST_FAKETIME)" go test -timeout 20m -v ./... -gocheck.v=true -check.f "$(TEST)" -coverprofile=unit.out; echo $$? > .unit-test.ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests should be run with -race to detect race conditions

@neolynx
Copy link
Member

neolynx commented Dec 28, 2025

@agustinhenze I cannot get the unit tests to work... looks like there is a deadlock now, the evil twin of race conditions...

I wonder if such long running tests would not be better implemented as system tests, testing api and cmdline.

Did I get this right, the tests should run things concurrently for a while and not loose files ?

@neolynx
Copy link
Member

neolynx commented Dec 29, 2025

it hangs in TestIdenticalPackageRace:430 I believe, here the logs:

Iteration 2/4
[GIN] 2025/12/29 - 16:19:50 | 201 |   20.537467ms |                 | POST     "/api/repos"
[GIN] 2025/12/29 - 16:19:50 | 201 |   41.633654ms |                 | POST     "/api/publish/identical"
[GIN] 2025/12/29 - 16:19:50 | 201 |   14.738816ms |                 | POST     "/api/repos"
[GIN] 2025/12/29 - 16:19:50 | 201 |  124.126469ms |                 | POST     "/api/publish/identical"
[GIN] 2025/12/29 - 16:19:50 | 200 |   36.300477ms |                 | PUT      "/api/publish/identical/dist-a-1"
[iter 1] Publish A complete: 200
[GIN] 2025/12/29 - 16:19:50 | 200 |   47.507265ms |                 | PUT      "/api/publish/identical/dist-b-1"
[iter 1] Publish B complete: 200
[GIN] 2025/12/29 - 16:19:50 | 200 |   67.549527ms |                 | POST     "/api/repos/identical-b-1/file/identical-upload-2-1?noRemove=0"
[iter 1] Import B complete: 200
***hangs***

here is the backtrace after the timeout:

coverage: 23.9% of statements
panic: test timed out after 2m0s
        running tests:
                Test (2m0s)

goroutine 1491 [running]:
testing.(*M).startAlarm.func1()
        /usr/lib/go-1.24/src/testing/testing.go:2484 +0x394
created by time.goFunc
        /usr/lib/go-1.24/src/time/sleep.go:215 +0x2d

goroutine 1 [chan receive, 2 minutes]:
testing.(*T).Run(0xc000602c40, {0x18aef75?, 0xc000051b30?}, 0x195d5f8)
        /usr/lib/go-1.24/src/testing/testing.go:1859 +0x431
testing.runTests.func1(0xc000602c40)
        /usr/lib/go-1.24/src/testing/testing.go:2279 +0x37
testing.tRunner(0xc000602c40, 0xc000051c70)
        /usr/lib/go-1.24/src/testing/testing.go:1792 +0xf4
testing.runTests(0xc000406060, {0x2f03430, 0x1, 0x1}, {0x2f57640?, 0x7?, 0x2f550e0?})
        /usr/lib/go-1.24/src/testing/testing.go:2277 +0x4b4
testing.(*M).Run(0xc0004fa000)
        /usr/lib/go-1.24/src/testing/testing.go:2142 +0x64a
main.main()
        _testmain.go:55 +0x9b

goroutine 68 [chan receive, 1 minutes]:
gopkg.in/check%2ev1.(*suiteRunner).runTest(...)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:813
gopkg.in/check%2ev1.(*suiteRunner).run(0xc000313000)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:618 +0x1ba
gopkg.in/check%2ev1.Run({0x17f1140?, 0xc000110180?}, 0x2332a10?)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:92 +0x25
gopkg.in/check%2ev1.RunAll(0xc00008aef8)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:84 +0x75
gopkg.in/check%2ev1.TestingT(0xc000603340)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/run.go:72 +0x25e
github.com/aptly-dev/aptly/api.Test(0xc000603340?)
        /work/src/api/api_test.go:24 +0x13
testing.tRunner(0xc000603340, 0x195d5f8)
        /usr/lib/go-1.24/src/testing/testing.go:1792 +0xf4
created by testing.(*T).Run in goroutine 1
        /usr/lib/go-1.24/src/testing/testing.go:1851 +0x413

goroutine 196 [select, 1 minutes]:
gopkg.in/check%2ev1.(*resultTracker)._loopRoutine(0xc000458750)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:464 +0x93
created by gopkg.in/check%2ev1.(*resultTracker).start in goroutine 68
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:444 +0x4f

goroutine 198 [chan receive]:
github.com/aptly-dev/aptly/api.acquireDatabase()
        /work/src/api/api.go:110 +0x65
created by github.com/aptly-dev/aptly/api.Router in goroutine 197
        /work/src/api/router.go:91 +0x679

goroutine 1401 [sync.WaitGroup.Wait, 1 minutes]:
sync.runtime_SemacquireWaitGroup(0xc00109d440?)
        /usr/lib/go-1.24/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0xc001007b20?)
        /usr/lib/go-1.24/src/sync/waitgroup.go:118 +0x48
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace(0xc000110180, 0xc000e361e0)
        /work/src/api/published_file_missing_test.go:460 +0xb78
reflect.Value.call({0x17f1140?, 0xc000110180?, 0xc00008fe28?}, {0x18af145, 0x4}, {0xc00008fef8, 0x1, 0x671d40827aa18?})
        /usr/lib/go-1.24/src/reflect/value.go:584 +0xca6
reflect.Value.Call({0x17f1140?, 0xc000110180?, 0xc00004a940?}, {0xc00008fef8?, 0x0?, 0xc00008fe96?})
        /usr/lib/go-1.24/src/reflect/value.go:368 +0xb9
gopkg.in/check%2ev1.(*suiteRunner).forkTest.func1(0xc000e361e0)
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:775 +0x535
gopkg.in/check%2ev1.(*suiteRunner).forkCall.func1()
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:669 +0x88
created by gopkg.in/check%2ev1.(*suiteRunner).forkCall in goroutine 68
        /work/src/.go/pkg/mod/gopkg.in/check.v1@v1.0.0-20201130134442-10cb98267c6c/check.go:666 +0x235

goroutine 227 [select]:
github.com/aptly-dev/aptly/task.(*List).consumer(0xc000447c80)
        /work/src/task/list.go:45 +0x7a
created by github.com/aptly-dev/aptly/task.NewList in goroutine 199
        /work/src/task/list.go:38 +0x1ee

goroutine 1554 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).compactionError(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:92 +0x149
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:148 +0x447

goroutine 1555 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).mpoolDrain(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_state.go:101 +0x9c
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:149 +0x485

goroutine 1557 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).mCompaction(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:782 +0xf3
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:158 +0x547

goroutine 1556 [select]:
github.com/syndtr/goleveldb/leveldb.(*DB).tCompaction(0xc0004bb880)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db_compaction.go:845 +0x6b7
created by github.com/syndtr/goleveldb/leveldb.openDB in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/db.go:157 +0x50b

goroutine 1550 [sync.WaitGroup.Wait, 1 minutes]:
sync.runtime_SemacquireWaitGroup(0xc00067f3d8?)
        /usr/lib/go-1.24/src/runtime/sema.go:110 +0x25
sync.(*WaitGroup).Wait(0x16cc3e0?)
        /usr/lib/go-1.24/src/sync/waitgroup.go:118 +0x48
github.com/aptly-dev/aptly/task.(*List).WaitForTaskByID(0xc000447c80, 0x60)
        /work/src/task/list.go:242 +0xf6
github.com/aptly-dev/aptly/api.maybeRunTaskInBackground(0xc006a12800, {0xc000fb2480, 0x40}, {0xc0003f9000, 0x3, 0x4}, 0xc004e04d20)
        /work/src/api/api.go:217 +0x3a5
github.com/aptly-dev/aptly/api.apiReposPackageFromDir(0xc006a12800)
        /work/src/api/repos.go:540 +0xbac
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/aptly-dev/aptly/api.Router.func2(0xc006a12800)
        /work/src/api/router.go:113 +0x13f
github.com/gin-gonic/gin.(*Context).Next(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 +0x2b
github.com/aptly-dev/aptly/api.Router.ErrorLogger.ErrorLoggerT.func8(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:172 +0x25
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.CustomRecoveryWithWriter.func1(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 +0x6f
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.LoggerWithConfig.func1(0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 +0xdd
github.com/gin-gonic/gin.(*Context).Next(...)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174
github.com/gin-gonic/gin.(*Engine).handleHTTPRequest(0xc0000deea0, 0xc006a12800)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 +0x64e
github.com/gin-gonic/gin.(*Engine).ServeHTTP(0xc0000deea0, {0x1b68140, 0xc0003acd00}, 0xc000141040)
        /work/src/.go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 +0x1aa
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).httpRequest(0xc000110180, 0xc000e361e0, {0x18aeffd, 0x4}, {0xc000fb3d80, 0x3d}, {0x0, 0x0, 0x0})
        /work/src/api/published_file_missing_test.go:132 +0x2d1
github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace.func1()
        /work/src/api/published_file_missing_test.go:433 +0x16a
created by github.com/aptly-dev/aptly/api.(*PublishedFileMissingSuite).TestIdenticalPackageRace in goroutine 1401
        /work/src/api/published_file_missing_test.go:430 +0x950

goroutine 1527 [select]:
github.com/syndtr/goleveldb/leveldb.(*session).refLoop(0xc0001e11d0)
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/session_util.go:189 +0x51c
created by github.com/syndtr/goleveldb/leveldb.newSession in goroutine 198
        /work/src/.go/pkg/mod/github.com/syndtr/goleveldb@v1.0.1-0.20220721030215-126854af5e6d/leveldb/session.go:93 +0x296
FAIL    github.com/aptly-dev/aptly/api  120.054s
FAIL

@neolynx
Copy link
Member

neolynx commented Dec 30, 2025

The existing task system serializes operations based on resource locks, preventing these race conditions when proper lock sets are acquired.

I think this assumption is wrong. The task list is unlocked while the task is still running, multiple publish tasks acquire the database and run concurrently, creating chaos and deadlocks.

a bit more logging reveals:

[iter 0] Import A
[iter 0] Import B
0xc00007e2b8: apiReposPackageFromDir 1
0xc000520258: apiReposPackageFromDir 1
0xc000520258: apiReposPackageFromDir 7
INF Executing task synchronously
runTaskInBackground
0xc000520440: tasklist::RunTaskInBackground lock
0xc000520440: tasklist::RunTaskInBackground locked
0xc000520440: tasklist::RunTaskInBackground unlock
acquireDatabaseConnection
apiReposPackageFromDir 8
0xc00007e2b8: apiReposPackageFromDir 7
INF Executing task synchronously
runTaskInBackground
0xc00007e370: tasklist::RunTaskInBackground lock
0xc00007e370: tasklist::RunTaskInBackground locked
0xc00007e370: tasklist::RunTaskInBackground unlock
acquireDatabaseConnection
apiReposPackageFromDir 8
0xc000520258: apiReposPackageFromDir 13
0xc00007e2b8: apiReposPackageFromDir 13
0xc000520258: apiReposPackageFromDir 14
0xc00007e2b8: apiReposPackageFromDir 14
0xc000520258: apiReposPackageFromDir 15
0xc000520258: apiReposPackageFromDir 16
releaseDatabaseConnection
0xc00007e2b8: apiReposPackageFromDir 15
0xc00007e2b8: apiReposPackageFromDir 16
releaseDatabaseConnection
0xc000520258: apiReposPackageFromDir done
0xc00007e2b8: apiReposPackageFromDir done

apiReposPackageFromDir should not modify the database in parallel tasks...

what do you think ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants