dekaf: Implement e2e testing framework #2566

jshearer · 2025-12-18T00:36:57Z

Summary

E2E test framework:
- publishes test specs into namespaced prefixes
- uses source-http-ingest to inject documents
- uses existing Dekaf Kafka API client to interact with low-level Kafka APIs
- uses rdkafka Rust crate to interact with Dekaf via high-level librdkafka consumer
During testing I found some poorly specified behavior around collections with missing journals, so I added CollectionStatus in order to return a retryable error to consumers and reduce connection churn.
I added --spec-ttl to allow tests to shorten the TaskManager spec refresh cycle and speed up tests that depend on changes to specs.
A few changes to KafkaApiClient:
- Plaintext connections are supported via tcp:// URL scheme
- SASL PLAIN authentication support
- --upstream-auth=none flag to skip authenticating to upstream Kafka broker if it doesn't require auth (such as when running tests)
Added Dekaf to CI:
- dekaf-test job in platform-test.yaml
- Exclude Dekaf tests in default nextest profile as they require things like a local Kafka broker to be running as a prerequisite. Add dekaf-e2e profile to run e2e tests explicitly
- Mise tasks (local:dekaf, local:dekaf-kafka, local:test-tenant) for local Dekaf + Kafka services, provision test/ tenant etc

Test scenarios

Test file	Scenarios
`collection_reset.rs`	Collection-reset behaviors such as dealing with leader epochs in Metadata, ListOffsets, Fetch, `FENCED_LEADER_EPOCH`, `UNKNOWN_LEADER_EPOCH`, `OffsetForLeaderEpoch`, etc
`not_ready.rs`	`LeaderNotAvailable` when journals don't exist
`list_offsets.rs`	Earliest/latest offset queries, unknown partition handling
`empty_fetch.rs`	Empty fetch responses don't break subsequent fetches
`basic.rs`	Document roundtrip

I intentionally left out more complex tests from here such as higher volume load testing, testing with other high-level consumers such as sarama, franz-go, etc. This PR is already getting fairly large, and I wanted to get it reviewed and out the door rather than add more heft to it.

Additional PRs adding more E2E tests based ontop of this branch:

Note: Some of those PRs are targeting master instead of dekaf/collection_reset_with_e2e_tests in order to get CI to run the tests. Once this PR is merged I'll mark them ready for review

crates/dekaf/tests/e2e/harness.rs

.cargo/config.toml

jgraettinger

This looks great! Nice job and great to see this level of test coverage. A few comments below for discussion but nothing big.

crates/dekaf/src/api_client.rs

jgraettinger · 2026-01-16T15:54:10Z

crates/dekaf/src/session.rs

+                        anyhow::bail!("Collection '{}' not found or not accessible", name)
+                    }
+                    CollectionStatus::NotReady => {
+                        // Collection exists but journals aren't available - return LeaderNotAvailable so clients will retry


I presume you thought about this carefully, but: the "obvious" approach would represent such a collection as a topic with zero partitions. That breaks at a protocol level?

Or, it looks like we previously returned a single partition. I'm guessing this was a placeholder which was semantically empty, and then popped into actual existence with the first created journal? It looks like you junked this tactic to make the E2E tests work better (or were there other reasons, too?).

FYI a sentinel partition approach would fit well with the tokens work and the task-manager sketch, which re-frames journal listing as a long-lived watch. Such a sentinel could block awaiting a journal listing update, be notified immediately when it's ready, and then transition to doing actual reads.

I changed it from returning a single partition because consumers would then try to ListOffsets/Fetch from that partition and get UnknownTopicOrPartition, which is a permanent error, so they wouldn't retry. In practice they (usually) eventually do retry when the application logic times out and restarts the consumer, but it felt smelly to have that in the tests.

I tried returning no partitions instead, but the consumer also treated that as a "permanent" situation for the lifetime of the session. LeaderNotAvailable is supposed to be a transient condition and it causes the consumer to retry according to its retry policy.

jgraettinger · 2026-01-16T15:59:10Z

crates/dekaf/src/session.rs

-            return Ok(OffsetForLeaderTopicResult::default()
-                .with_topic(topic.topic)
-                .with_partitions(partitions));
+        let collection = match Collection::new(auth, &collection_name).await? {


The quantity of new code, here, matching on CollectionStatus is a little alarming / a smell. Do these all truly need to be such distinct cases? Is there no factoring that could be applied, to collapse some of the commonality of these paths?

Nothing here looks incorrect, per se, but the change is violating my mental yardstick re: how much churn such a change should result in.

Yeah, agreed that the size of the diff is big. Some of it is real behavior improvements that I noticed during testing, for example metadata now returns topic-specific errors instead of failing the whole request, the group management APIs now validate that collections exist and are available whereas before they would let you interact with groups (join/sync/commit offset etc) with topics that didn't map to any extant collection at all, etc.

Still, I took another pass over it and ended up moving the Kafka error code matching logic to a couple of helpers on CollectionStatus. So for example instead of having the error codes spread throughout session, like this:

flow/crates/dekaf/src/session.rs

Lines 1659 to 1672 in 109edbf

let error_code = match status {

CollectionStatus::Ready(_) => continue,

CollectionStatus::NotFound => {

tracing::warn!(topic = ?topic_name, "Collection not found");

ResponseError::UnknownTopicOrPartition.code()

}

CollectionStatus::NotReady => {

tracing::warn!(

topic = ?topic_name,

"Collection exists but has no journals available"

);

ResponseError::LeaderNotAvailable.code()

}

};

You can instead just do this:

flow/crates/dekaf/src/session.rs

Lines 1636 to 1638 in 76e90af

let Some(error_code) = status.error_code() else {

continue;

};

.cargo/config.toml

jgraettinger · 2026-01-16T16:24:53Z

.github/workflows/platform-test.yaml

+
+      - uses: mozilla-actions/sccache-action@v0.0.9
+      - run: echo 'SCCACHE_GHA_ENABLED=true' >> $GITHUB_ENV
+      - run: mise run build:rocksdb


Break this out into a separate workflow, dekaf-test.yaml ?

Also please add a mise task ci:dekaf-test paralleling ci:platform-test so that there's a one-shot way to run these in a dev VM.

I broke the gh action into a separate file. WRT ci:dekaf-test, do you mean something other than ci:dekaf-e2e?

Dekaf previously required TLS and MSK IAM authentication for all upstream Kafka connections, making local development and testing difficult. This adds support for plaintext connections via URL scheme detection: * `tcp://host:port` connects without TLS, `tls://host:port` uses TLS (default) * `--upstream-auth=none` flag skips SASL authentication entirely * `KafkaClientAuth::from_msk_region(None)` creates no-auth mode Example local usage: dekaf --default-broker-urls tcp://localhost:29092 --upstream-auth=none ...

Used for testing

…ka errors It's possible for a collection to exist in the control plane without having any extant journals. This can happen either when the capture task is failing or hasn't emitted any documents, and more frequently during a collection reset. Previously, Dekaf treated this the same as a missing collection, causing consumers to receive non-retryable errors or inconsistent behavior. Introduces `CollectionStatus` enum to distinguish three states: * `Ready`: binding exists and journals are available * `NotFound`: binding doesn't exist in the materialization spec * `NotReady`: binding exists but journals aren't available yet For `NotReady`, we'll use `LeaderNotAvailable` (a retryable error) to cause consumers to retry with backoff until the journals become available. They will eventually give up.

This is mainly for e2e tests so we can set a low TTL and avoid waiting around for too long for changes to propagate.

* Run Dekaf e2e tests as separate step because `nexttest-run` messes with local stack state * Make `local:data-plane` idempotent * `ci:dekaf-e2e` now assumes `local:stack` etc are up rather than explicitly depending on it * mise: log systemd output if failure * mise: also log agent logs on failure * nexttest: exclude e2e tests by default, and run them with `--profile dekaf-e2e` instead

couple of non-covered tests over

psFried

LGTM 🎉 Nice work on the tests. This looks like a huge improvement

psFried · 2026-01-22T21:51:17Z

crates/dekaf/src/api_client.rs


    let mut state_buf = BufWriter::new(Vec::new());
    let mut state = session.step(None, &mut state_buf)?;
+    // Flush the BufWriter to ensure all data is written to the underlying Vec


nit: I'm supposing maybe into_inner() does a flush internally? Either way, this comment seems a bit confusing, as it says "flush" but then does not call flush

psFried · 2026-01-22T21:55:42Z

crates/dekaf/src/api_client.rs

            )
        }
+
+        if !state.is_running() {


nit: Why do we need to repeat this check here? Seems maybe worthy of a comment

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 4 times, most recently from 6490a0d to 8c53dd4 Compare December 19, 2025 16:06

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 9 times, most recently from b44dbee to 0aabc2e Compare December 29, 2025 20:30

jshearer commented Dec 29, 2025

View reviewed changes

crates/dekaf/tests/e2e/harness.rs Outdated Show resolved Hide resolved

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch from 3663f1c to d0c51a8 Compare December 29, 2025 22:00

github-advanced-security bot found potential problems Dec 29, 2025

View reviewed changes

crates/dekaf/tests/e2e/harness.rs Dismissed Show dismissed Hide dismissed

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 6 times, most recently from c291e83 to 134f474 Compare January 5, 2026 23:17

jshearer changed the title ~~WIP: Dekaf collection reset with e2e tests~~ dekaf: e2e testing Jan 6, 2026

jshearer self-assigned this Jan 6, 2026

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 2 times, most recently from 6ae3375 to a06a0a3 Compare January 6, 2026 18:20

jshearer commented Jan 6, 2026

View reviewed changes

.cargo/config.toml Outdated Show resolved Hide resolved

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch from a06a0a3 to d86adf3 Compare January 6, 2026 18:53

jshearer marked this pull request as ready for review January 6, 2026 18:57

jshearer requested a review from a team January 6, 2026 18:58

jshearer requested a review from jgraettinger January 6, 2026 21:08

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 5 times, most recently from e1c6baf to e819733 Compare January 8, 2026 16:11

jshearer requested a review from psFried January 8, 2026 17:35

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch from e819733 to 7823aab Compare January 8, 2026 20:14

jshearer changed the title ~~dekaf: e2e testing~~ dekaf: Implement e2e testing framework Jan 8, 2026

jgraettinger reviewed Jan 16, 2026

View reviewed changes

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 5 times, most recently from 8630748 to fe70d7b Compare January 21, 2026 22:35

jshearer added 4 commits January 21, 2026 18:00

dekaf: Add support for sasl PLAIN auth to KafkaApiClient

d2c0cb5

Used for testing

dekaf: make spec cache TTL configurable via --spec-ttl

4452e29

This is mainly for e2e tests so we can set a low TTL and avoid waiting around for too long for changes to propagate.

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch 2 times, most recently from 76e90af to 00c17dc Compare January 21, 2026 23:28

jshearer added 3 commits January 21, 2026 18:29

dekaf: remove redundant and broken/bitrotted integration test file, move

5ca430c

couple of non-covered tests over

dekaf: Add migration to fix dekaf role

40d0bd5

jshearer force-pushed the dekaf/collection_reset_with_e2e_tests branch from 00c17dc to 40d0bd5 Compare January 21, 2026 23:29

jshearer requested a review from jgraettinger January 21, 2026 23:29

psFried approved these changes Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dekaf: Implement e2e testing framework #2566

dekaf: Implement e2e testing framework #2566

jshearer commented Dec 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgraettinger left a comment

Uh oh!

Uh oh!

jgraettinger Jan 16, 2026

Uh oh!

jshearer Jan 21, 2026 •

edited

Loading

Uh oh!

jgraettinger Jan 16, 2026

Uh oh!

jshearer Jan 21, 2026

Uh oh!

Uh oh!

jgraettinger Jan 16, 2026

Uh oh!

jshearer Jan 21, 2026 •

edited

Loading

Uh oh!

psFried left a comment

Uh oh!

psFried Jan 22, 2026

Uh oh!

psFried Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	let error_code = match status {
	CollectionStatus::Ready(_) => continue,
	CollectionStatus::NotFound => {
	tracing::warn!(topic = ?topic_name, "Collection not found");
	ResponseError::UnknownTopicOrPartition.code()
	}
	CollectionStatus::NotReady => {
	tracing::warn!(
	topic = ?topic_name,
	"Collection exists but has no journals available"
	);
	ResponseError::LeaderNotAvailable.code()
	}
	};

	let Some(error_code) = status.error_code() else {
	continue;
	};

dekaf: Implement e2e testing framework #2566

Are you sure you want to change the base?

dekaf: Implement e2e testing framework #2566

Conversation

jshearer commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test scenarios

Additional PRs adding more E2E tests based ontop of this branch:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgraettinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgraettinger Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jshearer Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgraettinger Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jshearer Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgraettinger Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jshearer Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

psFried left a comment

Choose a reason for hiding this comment

Uh oh!

psFried Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

psFried Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jshearer commented Dec 18, 2025 •

edited

Loading

jshearer Jan 21, 2026 •

edited

Loading

jshearer Jan 21, 2026 •

edited

Loading