Skip to content

A lightweight, configuration-driven data pipeline framework

License

Notifications You must be signed in to change notification settings

telepair/pipeflow

Repository files navigation

pipeflow

Crates.io Documentation License: MIT

中文文档

A lightweight, configuration-driven data pipeline framework for Rust.

Source → Transform → Sink

Features

  • YAML Configuration: Declarative pipeline definition with DAG validation (duplicate IDs, missing to targets, cycles)
  • Fan-out: One source can fan-out to multiple sinks
  • Built-in Nodes:
    • Sources: http_client, http_server, file, redis, sql
    • Sinks: console, file, blackhole, http_client, http_server, redis, sql, notify::email, notify::telegram, notify::webhook
  • Management API: Optional HTTP API (/health, /metrics, /config, /config/graph), internal use only
  • CLI: run, config validate, config show, config graph

Feature Flags

Pipeflow uses Cargo features to keep optional dependencies behind flags.

  • api: Enables the management API server.
  • http-client (default): Enables the http_client source and sink.
  • http-server: Enables the http_server source and sink.
  • database: Enables sql source and sink.
  • redis: Enables the redis source and sink.
  • file (default): Enables the file source and sink.
  • notify (default): Enables the notify::email, notify::telegram, and notify::webhook sinks.

Core-only build (no optional sources/sinks):

cargo build --no-default-features

If a pipeline config references a node behind a disabled feature, Engine::build() returns a configuration error explaining which feature is required.

Quick Start

Requirements

  • Rust 1.92 or later (uses Rust 2024 edition)

Installation

cargo add pipeflow

Configuration

Create a pipeline configuration file pipeline.yaml:

system:
  # Management API configuration (optional, internal use only)
  # Binds to localhost by default. Use a reverse proxy for external access.
  api:
    enabled: true
    port: 8000
    # bind: "127.0.0.1"  # Default: localhost only
  # channel_size: default buffer size for transform/sink/internal channels (default: 256)
  channel_size: 1024

pipeline:
  sources:
    - id: api_poller
      type: http_client
      config:
        urls:
          - name: "default"
            url: "https://httpbin.org/json"
        interval: "10s"
        # schedule: "0 0 * * *" # Run daily at 00:00 (local time, 5 fields; seconds default to 0)

  transforms:
    - id: pass_through
      to: [console]

  sinks:
    - id: console
      type: console
      config:
        format: pretty

Wiring Nodes

Pipeflow wiring is source -> transform -> sink:

  • Sources and transforms declare to (one or more downstream transforms, sinks, or internal channels).
  • to must be non-empty for sources and transforms.
  • Transforms may omit steps to act as pass-through nodes.
  • Sinks are terminal and do not declare to; their target is defined by sink type/config (e.g. file path).

Loading from a directory

All commands that accept CONFIG also accept a directory. When a directory is provided, pipeflow loads all *.yaml / *.yml files in lexical order and merges them into a single configuration before normalization and validation.

This is useful for larger pipelines:

# Directory-based config
pipeflow run ./configs/
pipeflow config validate ./configs/
pipeflow config show ./configs/ --format yaml

Run (Programmatic)

use pipeflow::prelude::*;

#[tokio::main]
async fn main() -> Result<()> {
    let config = Config::from_file("pipeline.yaml")?;
    let mut engine = Engine::from_config(config)?;
    engine.build().await?;
    engine.run().await
}

Node Types

Sources

Type Description Status
http_client HTTP polling Implemented
http_server HTTP push/webhook Implemented
redis Redis GET polling Implemented
sql SQL polling Implemented
file File watching Implemented

Transforms

Type Description I/O Status
filter Conditional filtering 1:0/1 Implemented
compute Math expression eval 1:1 Implemented
remap Field mapping (step) 1:1 Implemented
split Split 1 to N messages 1:N Implemented
switch Route to different targets 1:1 Implemented

Sinks

Type Description Status
blackhole Discard messages Implemented
console Print to stdout Implemented
file Write to file Implemented
sql SQL database insert Implemented
redis Redis operations Implemented
http_client HTTP API calls Implemented
http_server HTTP pull endpoint Implemented
notify::email Email notifications Implemented
notify::telegram Telegram notifications Implemented
notify::webhook Webhook notifications Implemented

Internal Channels

Internal channels are built-in message streams that can be routed like any other target:

  • IDs: internal::audit, internal::event, internal::notify, internal::metric, internal::dlq
  • Configure under pipeline.internal with to, optional channel_size, and optional log.
  • to can be empty (messages are dropped after optional logging).
  • Default log levels: audit/metric = debug, notify = info, event/dlq = warn
  • If a message targets an internal channel, its payload must match that channel’s schema; otherwise it is dropped with a warning.

Configuration Reference

See docs/configuration/INDEX.md for detailed configuration parameters for all supported sources and sinks.

Channel Size Configuration

Pipeflow uses per-node tokio::sync::mpsc channels for transforms, sinks, and internal channels. Set the default buffer size via system.channel_size, and override per node with channel_size.

system:
  channel_size: 1024

pipeline:
  transforms:
    - id: enrich
      to: [sink1]
      channel_size: 2048
  sinks:
    - id: sink1
      type: file
      channel_size: 512
  internal:
    event:
      channel_size: 128

Dead Letter Queue

Failed messages are routed to the internal internal::dlq channel. You can attach sinks or transforms via pipeline.internal.dlq.to. Chain-depth protection prevents infinite loops (max depth: 8).

Current status:

  • internal::dlq routing: Implemented
  • Chain-depth protection: Implemented
  • Automatic DLQ routing on transform/sink errors: Implemented

See docs/DESIGN.md for the full design.

CLI Commands

# Run pipeline
pipeflow run config.yaml

# Run with verbose/debug output
pipeflow -v run config.yaml

# Validate configuration
pipeflow config validate config.yaml

# Show pipeline graph (ASCII)
pipeflow config graph config.yaml

# Show merged + normalized configuration
pipeflow config show config.yaml --format yaml

Global flags:

  • -v, --verbose - Enable debug logging

Notes:

  • pipeflow config validate checks YAML structure and pipeline wiring (IDs, references, cycles, internal routing). It does not validate node-specific config contents (e.g. required http_client.urls); those are validated during Engine::build() (and therefore pipeflow run).
  • If you use directory-based configs, config show displays the merged + normalized result.

Distributed & High Availability

Pipeflow is stand-alone by design.

To keep the architecture simple and robust (KISS principle), Pipeflow does not implement complex distributed coordination protocols (like Raft or Paxos).

  • Persistence: State (like silence records) is stored on the local filesystem (./data by default). We have removed complex distributed backends like Redis for silence to favor simplicity and filesystem atomicity.
  • Scaling: We recommend Manual Sharding. Deploy multiple independent instances, each handling a different subset of configuration files.
  • High Availability: Use detailed health checks (e.g., K8s liveness probes) to restart failed instances. If you need shared state across instances (e.g., shared silence), mount a shared volume (NFS/EFS) to the data_dir.

Documentation

Testing

# Unit + integration tests
cargo test --all-features

# Lint (clippy)
cargo clippy --all-targets --all-features -- -D warnings

# Format check
cargo fmt --all -- --check

License

MIT

About

A lightweight, configuration-driven data pipeline framework

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •