tbs: Fix potential data race #19948

ericywl · 2025-12-19T09:49:02Z

Summary

Fix potential data race between WriteTraceEvent in ProcessBatch and ReadTraceEvent in the sampling goroutine. Closes #17772.

Performance

Baseline

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14              3096828               357.5 ns/op
BenchmarkProcess-14              3284749               359.1 ns/op
BenchmarkProcess-14              3191538               353.3 ns/op
BenchmarkProcess-14              3333675               345.3 ns/op
BenchmarkProcess-14              3331615               345.4 ns/op
BenchmarkProcess-100             3439828               344.9 ns/op
BenchmarkProcess-100             3526257               325.7 ns/op
BenchmarkProcess-100             3461000               322.2 ns/op
BenchmarkProcess-100             3480249               394.0 ns/op
BenchmarkProcess-100             2877158               608.7 ns/op

Single Mutex

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14              3323389               337.5 ns/op
BenchmarkProcess-14              3482792               347.8 ns/op
BenchmarkProcess-14              3349486               333.5 ns/op
BenchmarkProcess-14              3437163               334.9 ns/op
BenchmarkProcess-14              3362293               333.8 ns/op
BenchmarkProcess-100             3516130               331.9 ns/op
BenchmarkProcess-100             3251674               339.8 ns/op
BenchmarkProcess-100             3432068               333.6 ns/op
BenchmarkProcess-100             3561802               368.2 ns/op
BenchmarkProcess-100             3290845               416.4 ns/op

ShardLockReadWriter

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14              3272188               346.3 ns/op
BenchmarkProcess-14              3415772               330.7 ns/op
BenchmarkProcess-14              3487447               333.2 ns/op
BenchmarkProcess-14              3470158               337.4 ns/op
BenchmarkProcess-14              3467367               338.6 ns/op
BenchmarkProcess-100             3626730               319.7 ns/op
BenchmarkProcess-100             3722044               369.1 ns/op
BenchmarkProcess-100             3123934               349.0 ns/op
BenchmarkProcess-100             3771914               381.9 ns/op
BenchmarkProcess-100             3462182               340.4 ns/op

ShardLockReadWriter with RWMutex

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14              3384834               343.1 ns/op
BenchmarkProcess-14              3284800               338.6 ns/op
BenchmarkProcess-14              3483038               348.2 ns/op
BenchmarkProcess-14              3217903               343.9 ns/op
BenchmarkProcess-14              3444128               334.4 ns/op
BenchmarkProcess-100             3115171               328.8 ns/op
BenchmarkProcess-100             3383545               329.2 ns/op
BenchmarkProcess-100             3069316               328.6 ns/op
BenchmarkProcess-100             3483396               334.0 ns/op
BenchmarkProcess-100             3359440               345.3 ns/op

github-actions · 2025-12-19T09:49:14Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2025-12-19T09:49:39Z

This pull request does not have a backport label. Could you fix it @ericywl? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

carsonip

Thanks. The setup looks about right, but the specific case that needs to be tested is not what I'm thinking about.

In this PR, there are txn1 and txn2 both part of trace1, which don't make sense to me. 1 trace should only have 1 root transaction. In your current setup, what should happen to txn2 is undefined.

To be clearer, what I'd like to test is similar: txn1 is root txn of trace1, and txn2 is a child of txn1.

At time t1: apm server receives txn1
t2: background sampling goroutine: apm server makes sampling decision for txn1
t2': apm server receives txn2
t3: background sampling goroutine: marks trace1 as sampled

^ the above is a race, because apm server receives txn2 between t2 and t3, and the result is txn2 is lost forever. If it happens either before t2 or after t3, txn2 is exported correctly.

It gets a bit theoretical but I believe it is possible. Lmk if you have any questions.

x-pack/apm-server/sampling/processor_test.go

carsonip

qq: should the test pass or fail here? I assume when the test passes, it means a race happened, right? If so, I think the test is correctly validating the race in its current state

x-pack/apm-server/sampling/processor_test.go

x-pack/apm-server/sampling/processor.go

carsonip

thanks, the approach looks good now

x-pack/apm-server/sampling/eventstorage/rw.go

carsonip

in the benchmarks do you mind also running at a higher GOMAXPROCS? e.g. -cpu=14,100 and see if it makes any difference?

carsonip

I'm terribly sorry - Yes the PR in its existing state will address new race conditions introduced in 9.0 due to the lack of db transactions, but CMIIW theoretically there's another type of race that is inherent in the RW design, which is also in 8.x, where:

goroutine A ProcessBatch: IsTraceSampled(traceID) returns ErrNotFound
background goroutine B responsible for publishing: WriteTraceSampled(traceID, true)
background goroutine B responsible for publishing: ReadTraceEvents(traceID)
goroutine A ProcessBatch: WriteTraceEvent(traceID, event1)

In this case, event1 will be written to DB and dropped silently.

Maybe we'll have to zoom out and rethink this. Either:

we'll give up addressing flush time races; or
introduce processor level locking; or
implement some less expensive handling that will help us get back events fallen victim to this race; or
merge this PR as is to resolve this kind of race as an improvement, but create a follow up issue to address this newly identified kind of race. I don't want us to merge this PR with the impression that we've already fixed publish time race.

thoughts?

ericywl · 2025-12-30T06:56:48Z

In that case, my previous solution (on top of ShardLockRW) should be able to catch this. The publishing will be deferred to another goroutine that waits until event1 is written.

carsonip

In that case, my previous solution (on top of ShardLockRW) should be able to catch this. The publishing will be deferred to another goroutine that waits until event1 is written.

I studied f187e90. My understanding is that it increments and decrements a per-trace-id counter before and after WriteTraceEvent in the ingest goroutine. And in the background publishing goroutine, between WriteTraceSampled and ReadTraceEvents it checks if the counter is 0. If it isn't 0, retry later.

The issue with this is it doesn't prevent the following sequence of events:

t1: ingest goroutine A IsTraceSampled
t2: background goroutine B WriteTraceSampled
t3: background goroutine B performs counter==0 check
t4: background goroutine B ReadTraceEvents
t4': ingest goroutine A +1 counter, WriteTraceEvent, -1 counter

Race happens and there is data loss when t4 < t4'. Therefore, in terms of correctness, f187e90 isn't race proof.

(On a side note, if the +1 counter happens before IsTraceSampled instead of WriteTraceEvent, I think it might be correct. But even in that case I'd prefer a simpler design with performance and memory implications that are easier to reason about.)

I have some ideas on how to fix it and let's take it offline. In any case we might want to have a more generic test (in addition to / replacing the existing one) that sends a lot of events at around sampling decision time. It may not be deterministic but will give us confidence on whether this class of race conditions is eliminated, without specifying the exact sequence.

This reverts commit b788c3f.

carsonip

thanks! code looks good, but i have a comment about reducing the write lock time without losing correctness. Other than that, mostly nits on maintainability.

Also, I've benchmarked the PR with a modified test,

func BenchmarkProcess(b *testing.B) {
	b.SetParallelism(10)
	cfg := newTempdirConfigLogger(b, logp.NewNopLogger()).Config
	cfg.FlushInterval = 1 * time.Second
	cfg.Policies[0].SampleRate = 1
	processor, err := sampling.NewProcessor(cfg, logp.NewNopLogger())
	require.NoError(b, err)
	go processor.Run()
	b.Cleanup(func() { processor.Stop(context.Background()) })

	b.RunParallel(func(pb *testing.PB) {
		var seed int64
		err := binary.Read(cryptorand.Reader, binary.LittleEndian, &seed)
		assert.NoError(b, err)
		rng := rand.New(rand.NewSource(seed))

		var traceID [16]byte
		for pb.Next() {
			binary.LittleEndian.PutUint64(traceID[:8], rng.Uint64())
			binary.LittleEndian.PutUint64(traceID[8:], rng.Uint64())
			transactionID := traceID[:8]
			spanID := traceID[8:]
			trace := modelpb.Trace{Id: hex.EncodeToString(traceID[:])}
			transaction := &modelpb.Transaction{
				Id: hex.EncodeToString(transactionID),
			}
			span := &modelpb.Span{
				Id: hex.EncodeToString(spanID),
			}
			batch := modelpb.Batch{
				{Trace: &trace, Transaction: transaction},
				{Trace: &trace, Span: span, ParentId: transaction.Id},
				//{Trace: &trace, Span: span, ParentId: transaction.Id},
				//{Trace: &trace, Span: span, ParentId: transaction.Id},
			}
			if err := processor.ProcessBatch(context.Background(), &batch); err != nil {
				b.Fatal(err)
			}
		}
	})
}

goos: linux
goarch: amd64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: AMD Ryzen 7 PRO 8840HS w/ Radeon 780M Graphics 
           │ main-100-2.bench │ tbs-potential-data-race-100.bench │
           │      sec/op      │      sec/op       vs base         │
Process-10       191.7n ± 11%       185.7n ± 16%  ~ (p=0.310 n=6)

and I tend to believe there is no perf regression, at least for a fast NVMe SSD. Therefore I think PR is good to merge after polishing and updating the subject and description.

carsonip · 2026-01-06T19:48:29Z

x-pack/apm-server/sampling/processor_test.go

+	return m.next.DeleteTraceEvent(traceID, id)
+}
+
+func TestPotentialRaceCondition(t *testing.T) {


q: what are your thoughts on this? should we keep this test?

I will remove it since the concurrent one catches it too.

carsonip · 2026-01-06T19:49:06Z

x-pack/apm-server/sampling/processor_test.go

+}
+
+func TestPotentialRaceConditionConcurrent(t *testing.T) {
+	flushInterval := 5 * time.Second


it can be way shorter, in ms or max 1s, because this affects total test time

carsonip · 2026-01-06T19:50:03Z

x-pack/apm-server/sampling/processor_test.go

+	reportedMu.Lock()
+	defer reportedMu.Unlock()
+	reportedPlusLateArrivals := int64(len(reported)) + lateArrivals.Load()
+	assert.Equal(t, reportedPlusLateArrivals, processed.Load())


Suggested change

assert.Equal(t, reportedPlusLateArrivals, processed.Load())

assert.Equal(t, processed.Load(), reportedPlusLateArrivals)

nit

carsonip · 2026-01-06T19:51:34Z

x-pack/apm-server/sampling/processor_test.go

+	var reportedMu sync.Mutex
+	reported := map[string]struct{}{}


nit: simplify to atomic int?

It's to deduplicate the transaction IDs.

carsonip · 2026-01-06T19:52:24Z

x-pack/apm-server/sampling/processor_test.go

+			first := true
+			index := i * 100000000
+
+			timer := time.NewTimer(flushInterval + 2*time.Second)


Suggested change

timer := time.NewTimer(flushInterval + 2*time.Second)

timer := time.NewTimer(flushInterval * 2)

carsonip · 2026-01-06T19:53:48Z

x-pack/apm-server/sampling/processor.go

+	for i := 0; i < numShards; i++ {
+		locks[i] = sync.RWMutex{}
+	}


Suggested change

for i := 0; i < numShards; i++ {

locks[i] = sync.RWMutex{}

}

nit: zero value is ready to use

carsonip · 2026-01-06T19:56:41Z

x-pack/apm-server/sampling/processor.go

 			events = events[:0]
-			if err := p.eventStore.ReadTraceEvents(traceID, &events); err != nil {
+			err = p.eventStore.ReadTraceEvents(traceID, &events)
+			p.shardLock.Unlock(traceID)


CMIIW but I believe it is sufficient to just lock WriteTraceSampled without locking ReadTraceEvents, given we know if IsTraceSampled returns true, we shortcircuit and write to batch processor immediately. Holding the write lock for as short as possible would be good for perf.

carsonip · 2026-01-06T19:57:20Z

x-pack/apm-server/sampling/processor.go

 				}
 			}

+			p.shardLock.Lock(traceID)


do you mind adding a comment about why we lock, and what scenario we are trying to prevent?

carsonip · 2026-01-06T20:07:28Z

x-pack/apm-server/sampling/processor_test.go

+				}
+
+				batch := modelpb.Batch{{
+					Trace: &modelpb.Trace{Id: fmt.Sprintf("trace%d", i)},


it'll be way easier to hit the race if different goroutines all share the same trace id, likely before the loop (remember to increment the processed count).

elasticmachine · 2026-01-15T05:23:31Z

💚 Build Succeeded

Buildkite Build
Commit: a0c53dc

History

💚 Build #8421 succeeded 6797a83
💚 Build #8388 succeeded 9655d36
💚 Build #8262 succeeded eba09dc
💚 Build #8253 succeeded b123a86
💚 Build #8235 succeeded 9193aab
💚 Build #8208 succeeded 0878f16

cc @ericywl

carsonip

lgtm thanks

github-actions · 2026-01-16T09:48:02Z

@Mergifyio backport 9.1 9.2 9.3

mergify · 2026-01-16T09:48:33Z

backport 9.1 9.2 9.3

✅ Backports have been created

Details

#20138 [9.1] (backport #19948) tbs: Fix potential data race has been created for branch 9.1 but encountered conflicts
#20139 [9.2] (backport #19948) tbs: Fix potential data race has been created for branch 9.2
#20140 [9.3] (backport #19948) tbs: Fix potential data race has been created for branch 9.3

* Add test confirming the potential data race * Remove unnecessary sleeps * Add assertion for transaction ids at the end * Add parent id to transaction2 * Update potential race condition test * Try fixing race condition * Fix bug where multiple ongoing trasactions can race to delete first * Add ShardLockReadWriter * Panic if numShards <= 0 * Remove unnecessary code * Use RWMutex * Make fmt * Add shard lock on processor level instead * Make fmt update * Revert "Make fmt update" This reverts commit b788c3f. * Update based on review (cherry picked from commit 67a5a2b)

* Add test confirming the potential data race * Remove unnecessary sleeps * Add assertion for transaction ids at the end * Add parent id to transaction2 * Update potential race condition test * Try fixing race condition * Fix bug where multiple ongoing trasactions can race to delete first * Add ShardLockReadWriter * Panic if numShards <= 0 * Remove unnecessary code * Use RWMutex * Make fmt * Add shard lock on processor level instead * Make fmt update * Revert "Make fmt update" This reverts commit b788c3f. * Update based on review (cherry picked from commit 67a5a2b) # Conflicts: # x-pack/apm-server/sampling/processor.go # x-pack/apm-server/sampling/processor_test.go

* Add test confirming the potential data race * Remove unnecessary sleeps * Add assertion for transaction ids at the end * Add parent id to transaction2 * Update potential race condition test * Try fixing race condition * Fix bug where multiple ongoing trasactions can race to delete first * Add ShardLockReadWriter * Panic if numShards <= 0 * Remove unnecessary code * Use RWMutex * Make fmt * Add shard lock on processor level instead * Make fmt update * Revert "Make fmt update" This reverts commit b788c3f. * Update based on review (cherry picked from commit 67a5a2b)

Add test confirming the potential data race

0894640

ericywl self-assigned this Dec 19, 2025

ericywl changed the title ~~Add test confirming the potential data race~~ tbs: Add test confirming the potential data race Dec 19, 2025

ericywl mentioned this pull request Dec 19, 2025

TBS: potential data loss in race condition between event arrival and receiving decision #17772

Closed

ericywl and others added 3 commits December 19, 2025 18:05

Remove unnecessary sleeps

67496ee

Add assertion for transaction ids at the end

fd8db35

Merge branch 'main' into tbs-potential-data-race

8a36e41

carsonip reviewed Dec 19, 2025

View reviewed changes

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

ericywl force-pushed the tbs-potential-data-race branch from 1e2035e to 8f27d8c Compare December 22, 2025 05:12

Add parent id to transaction2

a96c059

ericywl force-pushed the tbs-potential-data-race branch from 675d174 to a96c059 Compare December 22, 2025 05:20

ericywl requested a review from carsonip December 22, 2025 05:21

carsonip reviewed Dec 22, 2025

View reviewed changes

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

ericywl added 2 commits December 23, 2025 20:40

Update potential race condition test

406d642

Try fixing race condition

619a092

ericywl requested a review from carsonip December 23, 2025 12:44

carsonip reviewed Dec 23, 2025

View reviewed changes

x-pack/apm-server/sampling/processor.go Outdated Show resolved Hide resolved

ericywl and others added 5 commits December 24, 2025 14:45

Fix bug where multiple ongoing trasactions can race to delete first

f187e90

Add ShardLockReadWriter

85b0444

Panic if numShards <= 0

01029b6

Remove unnecessary code

7575b6f

Merge branch 'main' into tbs-potential-data-race

0878f16

carsonip reviewed Dec 24, 2025

View reviewed changes

x-pack/apm-server/sampling/eventstorage/rw.go Outdated Show resolved Hide resolved

carsonip reviewed Dec 24, 2025

View reviewed changes

ericywl and others added 2 commits December 26, 2025 11:57

Use RWMutex

c8cb86b

Merge branch 'main' into tbs-potential-data-race

9193aab

ericywl marked this pull request as ready for review December 29, 2025 03:26

ericywl requested a review from a team as a code owner December 29, 2025 03:26

ericywl and others added 2 commits December 29, 2025 11:26

Make fmt

b123a86

Merge branch 'main' into tbs-potential-data-race

eba09dc

ericywl changed the title ~~tbs: Add test confirming the potential data race~~ tbs: Fix potential data race Dec 29, 2025

ericywl added the backport-active-9 Automated backport with mergify to all the active 9.[0-9]+ branches label Dec 29, 2025

ericywl requested a review from carsonip December 29, 2025 12:41

carsonip reviewed Dec 29, 2025

View reviewed changes

carsonip reviewed Dec 30, 2025

View reviewed changes

ericywl and others added 5 commits January 5, 2026 13:44

Add shard lock on processor level instead

7e8ac96

Make fmt update

b788c3f

Merge branch 'main' into tbs-potential-data-race

9655d36

Merge branch 'main' into tbs-potential-data-race

6fd520b

Revert "Make fmt update"

6797a83

This reverts commit b788c3f.

carsonip reviewed Jan 6, 2026

View reviewed changes

Update based on review

a0c53dc

carsonip approved these changes Jan 15, 2026

View reviewed changes

ericywl added this pull request to the merge queue Jan 16, 2026

Merged via the queue into main with commit 67a5a2b Jan 16, 2026
21 checks passed

ericywl deleted the tbs-potential-data-race branch January 16, 2026 09:47

	assert.Equal(t, reportedPlusLateArrivals, processed.Load())
	assert.Equal(t, processed.Load(), reportedPlusLateArrivals)

	timer := time.NewTimer(flushInterval + 2*time.Second)
	timer := time.NewTimer(flushInterval * 2)

tbs: Fix potential data race #19948

tbs: Fix potential data race #19948

Uh oh!

Conversation

ericywl commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Baseline

Single Mutex

ShardLockReadWriter

ShardLockReadWriter with RWMutex

Uh oh!

github-actions bot commented Dec 19, 2025

🤖 GitHub comments

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

carsonip left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

ericywl commented Dec 30, 2025

Uh oh!

carsonip left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Jan 15, 2026

💚 Build Succeeded

History

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2026

ericywl commented Dec 19, 2025 •

edited

Loading

carsonip left a comment •

edited

Loading

carsonip left a comment •

edited

Loading

mergify bot commented Jan 16, 2026 •

edited

Loading