Apache Arrow data pipeline that reads and transfers data from Kafka or Redis to HDFS or S3 in Parquet format, enabling easy import into analytical databases like ClickHouse.
- Multiple Sources: Read from Kafka topics or Redis streams/lists
- Multiple Sinks: Write to S3 or HDFS
- Apache Arrow: Native Arrow support for high-performance data processing
- Parquet Format: Efficient columnar storage with compression
- Batch Buffering: Accumulate multiple batches in memory and write as single Parquet files on flush
- Configurable Flushing: Flush based on time intervals or batch counts
- Partitioning: Support for data partitioning by columns
- OpenTelemetry: Built-in observability with traces and metrics
- Flexible Configuration: YAML files, environment variables, or CLI options
- Modern Python: Built with Python 3.9+ and modern tooling
┌─────────────────┐
│ Data Sources │
├─────────────────┤
│ • Kafka Topics │ ──┐
│ • Redis Streams │ │
│ • Redis Lists │ │
└─────────────────┘ │
▼
┌─────────────────┐
│ FS Data Sink │
├─────────────────┤
│ • Arrow batches │
│ • Partitioning │
│ • Compression │
│ • Telemetry │
└─────────────────┘
│
▼
┌─────────────────────────────────┐
│ Data Sinks │
├─────────────────────────────────┤
│ • S3 (Parquet files) │
│ • HDFS (Parquet files) │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Analytics Databases │
├─────────────────────────────────┤
│ • ClickHouse │
│ • Presto/Trino │
│ • Spark │
└─────────────────────────────────┘
pip install fs-data-sinkgit clone https://github.com/poly-workshop/fs-data-sink.git
cd fs-data-sink
pip install -e .pip install -e ".[dev]"Create a configuration file config.yaml:
source:
type: kafka
bootstrap_servers:
- localhost:9092
topics:
- my-data-topic
group_id: fs-data-sink-group
value_format: json
batch_size: 1000
sink:
type: s3
bucket: my-data-bucket
prefix: raw-data
region_name: us-east-1
compression: snappy
partition_by:
- date
telemetry:
log_level: INFO
log_format: jsonfs-data-sink --config config.yamlOr using environment variables:
export SOURCE_TYPE=kafka
export KAFKA_BOOTSTRAP_SERVERS=localhost:9092
export KAFKA_TOPICS=my-data-topic
export SINK_TYPE=s3
export S3_BUCKET=my-data-bucket
fs-data-sink- Command-line options
- Environment variables
- Configuration file (YAML)
source:
type: kafka
bootstrap_servers:
- broker1:9092
- broker2:9092
topics:
- topic1
- topic2
group_id: my-consumer-group
value_format: json # or arrow_ipc
batch_size: 1000
extra_config:
auto_offset_reset: earliest
enable_auto_commit: trueEnvironment variables:
KAFKA_BOOTSTRAP_SERVERS: Comma-separated list of brokersKAFKA_TOPICS: Comma-separated list of topicsKAFKA_GROUP_ID: Consumer group ID
source:
type: redis
host: localhost
port: 6379
db: 0
password: optional
stream_keys:
- stream1
- stream2
list_keys:
- list1
value_format: json # or arrow_ipc
batch_size: 1000
continuous: true # Continuously consume data (default: true)
consumer_group: my-consumer-group # Required for Redis Streams
consumer_name: optional # Defaults to hostname-id if not providedThe Redis source supports two consumption modes:
- Continuous mode (default): Continuously polls Redis for new data, similar to Kafka consumer
- One-shot mode: Reads available data once and stops
Redis Streams Consumer Groups:
When using Redis Streams (stream_keys), a consumer_group is required. The source uses Redis consumer groups to:
- Track message consumption across multiple consumers
- Enable parallel processing with multiple pipeline instances
- Automatically acknowledge messages after successful processing with XACK
- Provide at-least-once delivery semantics
If consumer_name is not provided, it defaults to {hostname}-{id} to ensure uniqueness.
Environment variables:
REDIS_HOST: Redis hostREDIS_PORT: Redis portREDIS_PASSWORD: Redis passwordREDIS_STREAM_KEYS: Comma-separated list of stream keysREDIS_CONTINUOUS: Enable continuous consumption (true/false)REDIS_CONSUMER_GROUP: Consumer group name (required for streams)REDIS_CONSUMER_NAME: Consumer name within the group
The S3 sink now uses the MinIO Python client for better compatibility with MinIO servers, enhanced features, and improved reliability. It works seamlessly with AWS S3, MinIO, and other S3-compatible storage services.
sink:
type: s3
bucket: my-bucket
prefix: data/raw
aws_access_key_id: optional # Use IAM role if not provided
aws_secret_access_key: optional
region_name: us-east-1
endpoint_url: optional # For MinIO or other S3-compatible services
compression: snappy # snappy, gzip, brotli, zstd, none
partition_by:
- date
- hourBenefits of MinIO client:
- Better compatibility with MinIO servers
- Simplified API with better error handling
- Enhanced streaming support for large files
- Built-in retry logic for improved resilience
- Cross-cloud compatibility (AWS S3, MinIO, GCS, etc.)
Environment variables:
S3_BUCKET: S3 bucket nameS3_PREFIX: Prefix for objectsAWS_ACCESS_KEY_ID: AWS access keyAWS_SECRET_ACCESS_KEY: AWS secret keyAWS_REGION: AWS regionS3_ENDPOINT_URL: Custom endpoint for MinIO or S3-compatible services
sink:
type: hdfs
url: http://namenode:9870
base_path: /data/raw
user: hdfs
compression: snappy
partition_by:
- dateEnvironment variables:
HDFS_URL: HDFS NameNode URLHDFS_BASE_PATH: Base path for filesHDFS_USER: HDFS user
telemetry:
log_level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
log_format: json # json or text
enabled: true
service_name: fs-data-sink
otlp_endpoint: http://localhost:4317
trace_enabled: true
metrics_enabled: trueEnvironment variables:
LOG_LEVEL: Logging levelOTEL_ENABLED: Enable OpenTelemetry (true/false)OTEL_SERVICE_NAME: Service name for telemetryOTEL_EXPORTER_OTLP_ENDPOINT: OTLP endpoint URL
pipeline:
max_batches: null # null for unlimited
batch_timeout_seconds: 30
error_handling: log # log, raise, or ignore
flush_interval_seconds: null # Flush interval in seconds (null = flush only at end)
flush_interval_batches: null # Flush after N batches (null = flush only at end)Flush Interval and Batching:
The pipeline uses an efficient batching strategy where data is buffered in memory and written to Parquet files only when flushed. This reduces the number of small files created and improves performance.
Flush Behavior:
- Batches are accumulated in memory via
write_batch()calls - Parquet files are written to disk/S3/HDFS only when
flush()is called - Multiple buffered batches are combined into a single Parquet file per flush
Flush Triggers:
By default, the sink is flushed only when the pipeline completes. You can configure periodic flushing based on:
- Time interval (
flush_interval_seconds): Flush after a specified number of seconds - Batch count (
flush_interval_batches): Flush after processing a specified number of batches
If both are set, the sink will flush when either condition is met.
Non-blocking Behavior:
Sources are non-blocking to enable time-based flushes even when no data is available:
- When no data is available from the source, the generator yields
None - This allows the pipeline to check flush conditions (time-based) without blocking
- Previously, the pipeline would block waiting for data, preventing time-based flushes
Example with flush intervals:
pipeline:
flush_interval_seconds: 60 # Flush every 60 seconds
flush_interval_batches: 100 # OR flush every 100 batchesBenefits:
- Fewer, larger Parquet files instead of many small files
- Better compression ratios
- Reduced I/O operations
- Improved query performance in analytics databases
Environment variables:
PIPELINE_FLUSH_INTERVAL_SECONDS: Flush interval in secondsPIPELINE_FLUSH_INTERVAL_BATCHES: Flush interval in batches
The sink can automatically merge small Parquet files into larger consolidated files based on configured time periods (hourly, daily, weekly, monthly). This reduces the number of small files and improves query performance.
sink:
type: s3 # or hdfs, local
bucket: my-bucket
prefix: raw-data
compression: snappy
partition_by:
- date
# File merging configuration
merge_enabled: true # Enable automatic file merging
merge_period: hour # Time period for grouping files: 'hour', 'day', 'week', 'month'
merge_min_files: 2 # Minimum number of files required to trigger a merge
merge_on_flush: false # Whether to merge files during flush operationsMerge Behavior:
- Files are grouped by their timestamp (extracted from filename) and configured period
- Only groups with at least
merge_min_filesfiles will be merged - Merged files are named with the pattern
merged_{period_key}.parquet(e.g.,merged_20241113_06.parquetfor hourly) - Original files are deleted after successful merge
- Already merged files are skipped in subsequent merge operations
Merge Periods:
- hour: Merges files created within the same hour (e.g., all files from 2024-11-13 06:00-06:59)
- day: Merges files created within the same day (e.g., all files from 2024-11-13)
- week: Merges files created within the same ISO week (e.g., week 46 of 2024)
- month: Merges files created within the same month (e.g., November 2024)
Merge Triggers:
- Manual: Call
sink.merge_files()explicitly - On flush: Set
merge_on_flush: trueto merge after each flush operation - Scheduled: Run merge as a separate scheduled job/cron
Example configurations:
# Hourly merge (merge files from each hour)
sink:
merge_enabled: true
merge_period: hour
merge_min_files: 5
merge_on_flush: false
# Daily merge (merge files from each day)
sink:
merge_enabled: true
merge_period: day
merge_min_files: 10
merge_on_flush: trueBenefits:
- Reduces the number of small files (Small File Problem)
- Improves query performance in analytics databases
- Better resource utilization in distributed systems (Spark, Presto, Hive)
- Reduced metadata overhead
Environment variables:
SINK_MERGE_ENABLED: Enable file merging (true/false)SINK_MERGE_PERIOD: Merge period (hour/day/week/month)SINK_MERGE_MIN_FILES: Minimum files to trigger mergeSINK_MERGE_ON_FLUSH: Merge on flush (true/false)
Messages are expected to be JSON objects:
{
"id": 1,
"name": "Example",
"timestamp": "2024-01-01T00:00:00Z",
"value": 42.5
}Messages can be in Apache Arrow IPC (Inter-Process Communication) format for maximum performance. This is useful when producing data from systems that already use Arrow.
Stream data from Kafka to S3/HDFS for long-term storage and analytics:
fs-data-sink \
--source-type kafka \
--sink-type s3 \
--config production-config.yamlProcess accumulated data from Redis streams/lists:
fs-data-sink \
--source-type redis \
--sink-type hdfs \
--max-batches 100After data is written to S3/HDFS, import into ClickHouse:
-- Create table in ClickHouse
CREATE TABLE my_table
(
id UInt32,
name String,
timestamp DateTime,
value Float64
)
ENGINE = MergeTree()
ORDER BY (timestamp, id);
-- Import from S3
INSERT INTO my_table
SELECT * FROM s3(
'https://my-bucket.s3.amazonaws.com/raw-data/*.parquet',
'Parquet'
);# Clone repository
git clone https://github.com/poly-workshop/fs-data-sink.git
cd fs-data-sink
# Install with development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linters
ruff check .
black --check .
# Format code
black .fs-data-sink/
├── src/fs_data_sink/
│ ├── __init__.py
│ ├── cli.py # CLI entry point
│ ├── pipeline.py # Main pipeline orchestration
│ ├── types.py # Base interfaces
│ ├── sources/ # Data source implementations
│ │ ├── kafka_source.py
│ │ └── redis_source.py
│ ├── sinks/ # Data sink implementations
│ │ ├── s3_sink.py
│ │ └── hdfs_sink.py
│ ├── config/ # Configuration management
│ │ └── settings.py
│ └── telemetry/ # Observability
│ └── setup.py
├── tests/ # Test suite
├── config/ # Example configurations
├── pyproject.toml # Project metadata
└── README.md
Structured JSON logging is available:
{
"time": "2024-01-01T12:00:00",
"level": "INFO",
"name": "fs_data_sink.pipeline",
"message": "Processed batch 1: 1000 records",
"function": "run",
"line": 145
}When OpenTelemetry is enabled, the following metrics are exposed:
fs_data_sink.batches_processed: Number of batches processedfs_data_sink.records_processed: Number of records processedfs_data_sink.errors: Number of errors encountered
Distributed traces are available for:
- Pipeline execution
- Source operations (connect, read_batch, close)
- Sink operations (connect, write_batch, flush, close)
- Individual batch processing
- Batch Size: Adjust
batch_sizebased on message size and available memory - Compression: Use
snappyfor balance of speed and compression,zstdfor better compression - Partitioning: Use appropriate partition columns for your query patterns
- Arrow IPC: Use Arrow IPC format for maximum performance when possible
- Parallel Pipelines: Run multiple pipeline instances with different consumer groups
# Test Kafka connection
kafka-console-consumer --bootstrap-server localhost:9092 --topic my-topic --max-messages 1
# Test Redis connection
redis-cli ping
# Test S3 access
aws s3 ls s3://my-bucket/
# Test HDFS access
hdfs dfs -ls /Enable debug logging for troubleshooting:
fs-data-sink --log-level DEBUG --config config.yamlMIT License
Contributions are welcome! Please open an issue or pull request.
For issues and questions:
- GitHub Issues: https://github.com/poly-workshop/fs-data-sink/issues
- Documentation: See this README