Streaming sink #3
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements streaming part of transferia/transferia#199
Utilize apache/iceberg-go#339
Iceberg Streaming Sink
Key Components
Data Flow
The workflow follows these steps:
Streaming Process Diagram
Description of the Streaming Upload Process:
The streaming data upload process to an Iceberg table works as follows:
sequenceDiagram participant Coordinator participant Worker1 participant Worker2 participant S3 participant IcebergCatalog note over Worker1,Worker2: Streaming data processing par Worker 1 processing Worker1->>Worker1: Receive streaming data Worker1->>Worker1: Convert to Arrow format Worker1->>S3: Write Parquet file 1-1 Worker1->>Coordinator: Register file (key=streaming_files_tableA_1) and Worker 2 processing Worker2->>Worker2: Receive streaming data Worker2->>Worker2: Convert to Arrow format Worker2->>S3: Write Parquet file 2-1 Worker2->>Coordinator: Register file (key=streaming_files_tableA_2) end note over Worker1: On schedule (commit interval) Worker1->>Coordinator: Fetch all files Worker1->>Worker1: Group files by table Worker1->>IcebergCatalog: Create transaction Worker1->>IcebergCatalog: Add tableA files to transaction Worker1->>IcebergCatalog: Commit transaction Worker1->>Coordinator: Clear committed files note over Worker1,Worker2: Continue processing par Worker 1 continuation Worker1->>Worker1: Receive new streaming data Worker1->>Worker1: Convert to Arrow format Worker1->>S3: Write Parquet file 1-2 Worker1->>Coordinator: Register file (key=streaming_files_tableA_1) and Worker 2 continuation Worker2->>Worker2: Receive new streaming data Worker2->>Worker2: Convert to Arrow format Worker2->>S3: Write Parquet file 2-2 Worker2->>Coordinator: Register file (key=streaming_files_tableA_2) endImplementation Details
Worker Initialization
Each worker is initialized with:
When a worker starts, it creates a connection to the Iceberg catalog system (either REST-based or Glue-based) and prepares to handle incoming data.
Parquet File Creation
For each batch of data:
The file naming system ensures uniqueness by incorporating:
This prevents filename collisions even when multiple workers process data simultaneously.
File Tracking
Each worker maintains an in-memory map where the key is the table identifier and the value is a list of files created for that table. A mutex is used to ensure thread safety when adding to this list. Information about files is also passed to the coordinator with a key format of "streaming_files_{tableID}_{workerNum}".
Periodic Commits
Unlike snapshot mode, where commits happen only after the entire load is complete, in streaming mode:
Table Management