Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 13, 2025

Data from different Kafka topics and Redis streams was being written to the same directory, making organization and querying inefficient.

Changes

Sources

  • Kafka: Group messages by topic, add topic to RecordBatch schema metadata
  • Redis: Group messages by stream_key/list_key, add stream_key to schema metadata

Sinks (S3, HDFS, Local)

  • Extract topic/stream_key from batch metadata during flush
  • Create separate folders: prefix/{topic}/data_*.parquet
  • Maintains compatibility with partitioning: prefix/{topic}/{partition}/data_*.parquet

Folder Structure

Before:

s3://bucket/raw-data/
  data_20241113_120000_000001.parquet  # mixed topics
  data_20241113_120001_000002.parquet

After:

s3://bucket/raw-data/
  events/
    date=2024-01-01/data_*.parquet
  logs/
    date=2024-01-01/data_*.parquet
  metrics/
    data_*.parquet

Backward Compatibility

Batches without topic/stream_key metadata write to base directory as before. No configuration changes required.

Original prompt

This section details on the original issue you should resolve

<issue_title>Sink should store data from different topics/stream_keys separately in different folders</issue_title>
<issue_description></issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 3 commits November 13, 2025 08:03
Co-authored-by: slhmy <31381093+slhmy@users.noreply.github.com>
Co-authored-by: slhmy <31381093+slhmy@users.noreply.github.com>
Co-authored-by: slhmy <31381093+slhmy@users.noreply.github.com>
Copilot AI changed the title [WIP] Add functionality to store data from different topics separately Separate data by topic/stream_key into distinct folders Nov 13, 2025
Copilot AI requested a review from slhmy November 13, 2025 08:12
@slhmy slhmy marked this pull request as ready for review November 13, 2025 08:16
@slhmy slhmy merged commit 3378b6f into main Nov 13, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sink should store data from different topics/stream_keys separately in different folders

2 participants