Skip to content

pondwatch/drava

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drava - Ultra-High-Performance Synthetic Log Generator

Drava is an ultra-high-performance synthetic log generator designed to simulate distributed microservices producing logs at extreme rates. It generates structured JSON logs with realistic fields and can output either directly to Parquet files or stream to stdout with blazing-fast performance.

Features

  • 🚀 Extreme Performance: Generate logs from 100/sec to 2M+/sec with real-time precision
  • ⚡ Maximum Speed Generation: New --count option generates exact number of logs at 700k+/sec
  • 🔥 Batch-Optimized: Dynamic batching system for maximum throughput at high rates
  • Multiple Output Modes:
    • stdout - JSON logs for piping with optimized I/O batching
    • parquet - Direct-to-Parquet with hourly partitioning and high-rate batching
  • Realistic Data: Configurable services, log levels, and message templates
  • Deterministic: Optional seed for reproducible generation
  • Lightweight: Single Rust binary with no external dependencies
  • Smart Timing: Hybrid spin-wait/sleep mechanism for microsecond precision

Installation

From Source

git clone https://github.com/pondwatch/drava
cd drava
cargo build --release
./target/release/drava --help

Usage

Basic Examples

Generate exact number of logs at maximum speed:

# Generate exactly 10,000 logs at max speed (~700k/sec)
drava --count 10000

# Generate 1 million logs to Parquet files at max speed
drava --count 1000000 --output parquet

# Generate 50k logs with deterministic seed
drava --count 50000 --seed 12345

Generate logs at specific rates:

# Generate 1000 logs/sec for 10 seconds
drava --rate 1000 --duration 10

# Generate logs indefinitely at 5000/sec
drava --rate 5000

# Generate 10k logs/sec for 60 seconds, save to Parquet
drava --rate 10000 --duration 60 --output parquet

Customize services and log levels:

# Only generate logs for specific services
drava --services "api,database,cache" --levels "INFO,ERROR"

# Generate exact count with custom services
drava --count 25000 --services "payments,auth" --levels "ERROR,WARN"

Command Line Options

Options:
  -r, --rate <RATE>          Log generation rate per second [default: 10000]
  -s, --services <SERVICES>  Service names (comma-separated) [default: api,auth,payments,users,frontend]
  -l, --levels <LEVELS>      Log levels (comma-separated) [default: INFO,WARN,ERROR,DEBUG]
  -d, --duration <DURATION>  Duration in seconds (0 for infinite) [default: 0]
  -c, --count <COUNT>        Number of logs to generate (overrides rate and duration for max speed generation)
      --seed <SEED>          Random seed for deterministic generation
  -o, --output <OUTPUT>      Output mode: stdout or parquet [default: stdout]
  -h, --help                 Print help
  -V, --version              Print version

Maximum Speed Generation

The --count option is perfect for generating exact datasets at maximum speed without rate limiting:

Key Benefits

  • 🎯 Exact Count: Generate precisely the number of logs you need
  • ⚡ Maximum Speed: No rate limiting - generates as fast as your system allows
  • 📊 Performance Reporting: Shows actual generation rate achieved
  • 🔧 All Options Supported: Works with services, levels, seed, and output modes

Performance Comparison

# Rate-limited: Generate 100k logs at 100k/sec (takes 1 second)
drava --rate 100000 --duration 1

# Maximum speed: Generate 100k logs as fast as possible (takes ~0.15 seconds!)
drava --count 100000

Use Cases for --count

  • Benchmarking: Generate exact dataset sizes for consistent testing
  • Quick Prototyping: Instantly create test datasets of any size
  • Performance Testing: Generate large datasets in minimal time
  • Controlled Experiments: Ensure reproducible dataset sizes

Log Schema

Each log entry includes:

  • timestamp - ISO8601 timestamp
  • service - Service name (e.g., "api", "auth", "payments")
  • level - Log level ("INFO", "WARN", "ERROR", "DEBUG")
  • message - Realistic message from predefined templates
  • trace_id - UUID for request tracing

Example Log Entry

{
  "timestamp": "2025-09-09T18:31:07.666301217+00:00",
  "service": "payments",
  "level": "INFO",
  "message": "User login successful",
  "trace_id": "9278ce8d-3960-4c7c-83a6-8b4673d77c4f"
}

Parquet Output

When using --output parquet, logs are written to files with hourly partitioning:

logs/
├── 2025/
│   └── 09/
│       └── 09/
│           └── 18/
│               ├── part-<uuid1>.parquet
│               └── part-<uuid2>.parquet

Files are compressed with ZSTD and can be queried directly with DuckDB:

-- Query all logs from a specific hour
SELECT COUNT(*), service, level 
FROM 'logs/2025/09/09/18/*.parquet' 
GROUP BY service, level;

-- Time-based analysis
SELECT DATE_TRUNC('minute', timestamp) as minute, COUNT(*) 
FROM 'logs/2025/09/09/18/*.parquet' 
GROUP BY minute 
ORDER BY minute;

Performance

  • 🔥 Extreme Rates: Successfully tested from 100/sec to 2,000,000+ logs/sec
  • ⚡ Real-time Generation: Generates logs in real-time with microsecond precision
  • 💪 Massive Throughput: Up to 172 billion logs per day at maximum rates
  • 🧠 Smart Batching: Dynamic batch sizes (1% of rate) for optimal performance
  • 💾 Memory Efficient: Low memory footprint with buffer reuse and efficient batching
  • 📁 File Size: ~78KB parquet file for ~2400 logs with ZSTD compression
  • 🦆 DuckDB Compatible: Generated parquet files work seamlessly with DuckDB

Benchmark Results

Rate-Limited Generation:

Rate Duration Runtime Efficiency Logs Generated
50k/sec 2s 2.01s 99.5% ~100k
100k/sec 1s 1.00s 100% ~100k
500k/sec 1s 1.00s 100% ~500k
1M/sec 1s 1.00s 100% ~1M
2M/sec 1s 1.00s 100% ~2M

Maximum Speed Generation (--count):

Count Output Time Speed Notes
1,000 stdout 0.002s 543k/sec Small batch
100,000 stdout 0.156s 642k/sec Medium batch
1,000,000 stdout 1.589s 629k/sec Large batch
250,000 parquet 0.346s 722k/sec Compressed Parquet

Use Cases

  • 🏋️ Extreme Performance Testing: Stress-test log ingestion pipelines at enterprise scale
  • 🚀 High-Rate Data Generation: Generate massive datasets for big data testing
  • ⚡ Real-time Simulation: Simulate high-traffic microservices environments
  • 📊 Development: Generate test data without real microservices
  • 🎓 Education: Learn log analytics with realistic datasets
  • 📈 Benchmarking: Test DuckDB query performance on log data
  • 🔬 Load Testing: Generate billions of logs for system stress testing
  • 🎯 Exact Dataset Creation: Generate precise number of logs for controlled experiments
  • ⏱️ Quick Prototyping: Generate specific log counts instantly for testing

Technical Details

  • Language: Rust for maximum performance and zero-cost abstractions
  • Dependencies: Arrow, Parquet, Chrono, Clap, Tokio
  • Compression: ZSTD for optimal parquet file sizes
  • Smart Batching:
    • Rate-limited: Dynamic batch sizes (1% of rate, 100-10k logs per batch)
    • Maximum speed: Large batches (10k stdout, 50k Parquet) for optimal throughput
    • Batch-level I/O operations to minimize syscalls
    • Buffer reuse to reduce memory allocations
  • Precision Timing:
    • Hybrid spin-wait for sub-100μs intervals
    • Thread sleep for longer intervals
    • Processing time compensation for accurate rates
  • Threading: Single-threaded with microsecond-precise timing control
  • Optimizations:
    • JSON buffer reuse
    • Batch stdout flushing
    • Optimized timestamp parsing
    • Minimal memory allocations

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

About

Lightweight fake log emitter for large-scale log analytics experiments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages