-
Notifications
You must be signed in to change notification settings - Fork 0
Snapshot sink #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Iceberg Snapshot Sink
Implements snapshot part of transferia/transferia#199
Utilize apache/iceberg-go#339
Overview
The Iceberg Snapshot Sink is a component of the Iceberg Provider in the Transferia ecosystem. It's designed to efficiently handle data ingestion in snapshot mode, converting incoming data into Parquet files stored in a distributed manner, with later consolidation into an Iceberg table.
Architecture
Key Components
Data Flow
The workflow follows these steps:
Sharded Sequence Diagram
The sharded data upload process to an Iceberg table works as follows:
This approach provides efficient parallel data processing with atomic result committal, ensuring data integrity.
sequenceDiagram participant Coordinator participant Worker Main participant Worker2 participant Worker3 participant S3 participant IcebergCatalog Worker Main->>Coordinator: Shard snapshot parts %% File creation phase - parallel operations par Worker 2 parralel processing Worker2->>Coordinator: Fetch parts to load Worker2->>Worker2: Convert data to Arrow format Worker2->>S3: Write Parquet file 2-1 Worker2->>Worker2: Track file metadata in memory and Worker 3 parralel processing Worker3->>Coordinator: Fetch parts to load Worker3->>Worker3: Convert data to Arrow format Worker3->>S3: Write Parquet file 3-1 Worker3->>Worker3: Track file metadata in memory Worker3->>S3: Write Parquet file 3-2 Worker3->>Worker3: Track file metadata in memory end %% Completion phase Worker2->>Coordinator: Register files (key=files_for_2) Worker2-->>Coordinator: Signal completion Worker3->>Coordinator: Register files (key=files_for_3) Worker3-->>Coordinator: Signal completion %% Final commit phase - lead worker (Worker1) handles this Coordinator->>Worker Main: Wait all workers completed Worker Main->>Coordinator: Fetch all file lists Worker Main->>Worker Main: Combine file lists Worker Main->>IcebergCatalog: Create transaction Worker Main->>IcebergCatalog: Add all files to transaction Worker Main->>IcebergCatalog: Commit transactionImplementation Details
Worker Initialization
Each worker is initialized with:
When a worker starts, it creates a connection to the Iceberg catalog system (either REST-based or Glue-based) and prepares to handle incoming data.
Parquet File Creation
For each batch of data:
The file naming system ensures uniqueness by incorporating:
This prevents filename collisions even when multiple workers process data simultaneously.
File Tracking
Each worker maintains an in-memory list of all the files it has created. A mutex is used to ensure thread safety when appending to this list. This allows the worker to keep track of its contribution to the overall dataset.
Completion and Coordination
When a worker finishes processing its portion of the data, it receives a completion signal and then:
Final Commit
When all workers have completed (marked by a special completion event), one designated worker:
This final step ensures that all data becomes visible to readers in a single atomic operation, providing consistency guarantees.