Zero-disk ETL pipeline for streaming massive datasets directly from compressed sources to PostgreSQL via AWS S3
Stream 40GB+ compressed data without ever writing decompressed files to disk. Built for scale, designed for cost-efficiency.
Traditional ETL pipelines for large datasets:
Download (2.7GB) → Decompress to Disk (40GB) → Upload to S3 → Insert to DB
❌ Disk Space Exhausted
❌ Network Saturated
❌ Slow & Expensive
Hotel Stream-ETL:
Download → Decompress (Memory Only) → S3 → PostgreSQL
✅ Zero Intermediate Disk Writes
✅ 196+ Records/Sec
✅ <500MB Memory
✅ Production-Ready
| Metric | Value |
|---|---|
| Test Dataset | 40GB (2.7GB compressed) |
| Memory Used | <500MB |
| Insert Rate | 196 records/sec |
| Total Time | 1,049 seconds (17.5 min) |
| Success Rate | 100% |
| Disk Space Required | 0 bytes* |
*Database storage only
┌────────────────────────┐
│ Compressed Source │
│ (2.7GB zstd file) │
└───────────┬────────────┘
│ Stream download
▼
┌────────────────────────┐
│ S3 Multipart Upload │
│ (No disk intermediate)│
└───────────┬────────────┘
│ Decompress + Stream
▼
┌────────────────────────┐
│ PostgreSQL Insert │
│ (Batch UPSERT) │
│ 196 records/sec │
└────────────────────────┘
- ✅ Pure Streaming - No decompressed files on disk
- ✅ High Throughput - 196+ records/sec insert rate
- ✅ Memory Efficient - Peak usage <500MB
- ✅ Type Safe - Full TypeScript, strict mode
- ✅ S3 Integration - AWS SDK v3, multipart uploads
- ✅ Connection Pooling - Optimized PostgreSQL access
- ✅ Real-time Monitoring - CLI dashboard with stats
- ✅ Error Handling - Retry logic with backoff
- ✅ SSL/TLS Support - Secure database connections
- ✅ Caching - Skip downloads for recent files
- ✅ Production Ready - Comprehensive logging
- Node.js 18+
- PostgreSQL 13+
- AWS Account (S3 access)
# Clone repository
git clone https://github.com/yourusername/stream-etl.git
cd stream-etl
# Install dependencies
npm install
# Configure environment
cp .env.example .env
# Edit .env with your credentials
# Build
npm run build
# Run
npm run pipeline
# Monitor (in another terminal)
npm run monitorCreate .env file:
# Data Source
KEY_ID=your_api_key_id
API_KEY=your_api_key
# AWS S3
AWS_REGION=us-east-1
S3_BUCKET=your-bucket-name
AWS_ACCESS_KEY_ID=***
AWS_SECRET_ACCESS_KEY=***
# PostgreSQL
DB_HOST=your-db-host
DB_PORT=5432
DB_NAME=your_database
DB_USER=your_user
DB_PASSWORD=***npm run pipelineDownloads → Decompresses → Uploads to S3 → Inserts to PostgreSQL
npm run monitorReal-time statistics and progress
npm run buildnpm run type-check| Component | Technology |
|---|---|
| Language | TypeScript 5.0 |
| Runtime | Node.js 18+ |
| Cloud | AWS S3 |
| Database | PostgreSQL 13+ |
| Streaming | Node.js Transform streams |
| Compression | zstd |
╔══════════════════════════════════════════════════════════╗
║ Stream-ETL: Hotel Data Sync Pipeline ║
╚══════════════════════════════════════════════════════════╝
Step 0: Testing database connection...
✓ PostgreSQL connection successful
Step 1: Checking for recent file...
✓ Found recent S3 file (0.0 hours old)
Step 2: (Skipped - using cached)
Step 3: Streaming from S3 and inserting...
✓ Processed 530 batches (106,000 records)
Performance Summary:
Insert time: 542.18s
Insert rate: 196 records/sec
Success rate: 100% (0 failed)
Total time: 1589.88s
✅ Good for:
- One-time or infrequent data syncs
- Fixed-size datasets (GBs to TBs)
- Constrained disk environments
- Node.js infrastructure
- Budget-conscious operations
❌ Consider alternatives for:
- Complex multi-step transformations → Apache Spark
- Mission-critical recurring ETL → Apache Airflow
- Multi-source orchestration → AWS Glue
- Real-time streaming → Kafka/Flink
No records inserted?
SELECT COUNT(*) FROM hotels;Connection timeout?
# Verify database connectivity
telnet your-db-host 5432S3 upload stalled?
# Check S3 bucket access
aws s3 ls s3://your-bucket/See Troubleshooting Guide for more.
We welcome contributions! Please see CONTRIBUTING.md for:
- Code standards
- Development setup
- Pull request process
- Reporting issues
Quick start:
git checkout -b feature/your-feature
npm run build
npm run test
git push origin feature/your-feature
# Open PRMIT License - see LICENSE for details
- Issues → GitHub Issues
- Discussions → GitHub Discussions
- Email → dev@yourdomain.com
If Stream-ETL helped you, please give us a star!
Made with ❤️ for data engineers who care about performance
Set up S3 credentials in .env:
AWS_REGION=us-east-1
S3_BUCKET=your-bucket-name
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_keyAdd your WorldOTA API credentials in .env:
KEY_ID=your_api_key_id
API_KEY=your_api_key# WorldOTA API
KEY_ID=your_key_id
API_KEY=your_api_key
# AWS S3
AWS_REGION=us-east-1
S3_BUCKET=hotel-dumps
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
# PostgreSQL
DB_HOST=localhost
DB_PORT=5432
DB_NAME=hotel_sync
DB_USER=postgres
DB_PASSWORD=your_passwordnpm run build
npm startOr with TypeScript directly:
npx ts-node src/pipelines/usage.tsimport HotelSyncPipeline from "./pipelines/hotelSyncPipeline";
const pipeline = new HotelSyncPipeline({
keyId: process.env.KEY_ID!,
apiKey: process.env.API_KEY!,
s3Config: {
region: process.env.AWS_REGION!,
bucket: process.env.S3_BUCKET!,
accessKeyId: process.env.AWS_ACCESS_KEY_ID!,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY!,
},
postgresConfig: {
host: process.env.DB_HOST!,
port: parseInt(process.env.DB_PORT || "5432"),
database: process.env.DB_NAME!,
user: process.env.DB_USER!,
password: process.env.DB_PASSWORD!,
},
batchSize: 200,
keepLocalFiles: false,
});
const stats = await pipeline.run();
console.log(stats);Handles downloading and decompressing hotel dumps from WorldOTA.
const service = new HotelDumpService({
keyId: "your_key",
apiKey: "your_api_key",
downloadDir: "./downloads",
maxRetries: 3,
});
const { records, stats } = await service.processHotelDump();Manages PostgreSQL connections and inserts.
const db = new PostgresService({
host: "localhost",
port: 5432,
database: "hotel_sync",
user: "postgres",
password: "password",
});
await db.testConnection();
await db.createHotelsTable();
await db.insertHotels(records, { upsert: true });Handles S3 operations and streaming.
// Decompress directly to S3
await decompressStreamToS3(compressedPath, s3Config, "hotels.jsonl");
// Stream from S3 with batching
await streamHotelsFromS3(s3Config, "hotels.jsonl", async (batch) => {
console.log(`Processing batch of ${batch.length}`);
});Orchestrates the complete workflow.
const pipeline = new HotelSyncPipeline(config);
const stats = await pipeline.run();The pipeline creates a hotels table with the following structure:
CREATE TABLE hotels (
id SERIAL PRIMARY KEY,
hotel_id VARCHAR(255) UNIQUE NOT NULL,
name VARCHAR(500),
description TEXT,
country VARCHAR(255),
state VARCHAR(255),
city VARCHAR(255),
zip_code VARCHAR(20),
address VARCHAR(500),
latitude DECIMAL(10, 8),
longitude DECIMAL(11, 8),
star_rating DECIMAL(2, 1),
phone VARCHAR(20),
fax VARCHAR(20),
website VARCHAR(500),
email VARCHAR(255),
check_in_time VARCHAR(10),
check_out_time VARCHAR(10),
images TEXT[],
amenities TEXT[],
languages TEXT[],
raw_data JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);Indexes are created on:
hotel_id(unique)citycountrystar_ratingupdated_at
-
Batch Size: Adjust
batchSizebased on memory and network:- Small machines: 50-100
- Normal machines: 200-500
- Large machines: 1000+
-
Database Connection Pool: Increase
maxConnectionsfor faster inserts (default 20) -
AWS Region: Use the same region as your database if possible to reduce latency
-
Disk Space: Ensure at least 5GB free space for download/decompression
The pipeline provides detailed logs:
Step 0: Testing database connection...
✓ PostgreSQL connection successful: 2024-02-05 12:00:00
✓ Hotels table ready
Step 1: Fetching and downloading dump file...
Downloading dump file...
Progress: 45.2%
✓ Dump file downloaded: ./downloads/hotel_dump_1770163443318.jsonl.zst (2345.67MB)
Step 2: Decompressing and uploading to S3...
Decompressing and uploading to S3...
Upload progress: 78.3%
✓ File decompressed and uploaded to S3: s3://bucket/hotel_dump_1770163443318.jsonl
Step 3: Streaming from S3 and inserting into PostgreSQL...
Inserted: 1000 records
Inserted: 2000 records
✓ Streamed 50000 hotel records from S3
Performance Summary:
Download: 120.45s
Decompress+S3: 45.20s
DB Insert: 180.30s
Total: 345.95s
Data Summary:
Total Records: 50000
Successful: 49850
Failed: 150
Success Rate: 99.70%
- Check PostgreSQL is running:
psql -h localhost - Verify credentials in
.env
- Verify AWS credentials
- Check bucket exists and is accessible
- Ensure region is correct
- Reduce
batchSize - Check available system memory
- Close other applications
- Increase
maxConnectionsin PostgreSQL config - Check database disk space
- Verify network latency to database
MIT