Skip to content

mediacloud/sous-chef

Repository files navigation

Sous-Chef

Sous-Chef is a data pipeline tool for building and executing data processing workflows with MediaCloud and other data sources.

Version 3.0.0

This version introduces a new Python-based flow architecture. Flows are defined as Python functions with Pydantic parameter models, providing better type safety and easier testing compared to the previous YAML-based recipe system.

Quick Start

from sous_chef.flows import keywords_demo_flow
from sous_chef.flow import get_flow

# List available flows
from sous_chef.flow import list_flows
flows = list_flows()

# Run a flow
flow_meta = get_flow("keywords_demo")
result = flow_meta["func"].fn(params)

Running Flows Locally

Use the run_flow.py script to test flows without Prefect orchestration:

# List available flows
python run_flow.py --list

# Run a flow interactively
python run_flow.py keywords_demo --interactive

# Run with parameters
python run_flow.py keywords_demo --query "climate change" --start-date 2024-01-01 --end-date 2024-01-07

Architecture

  • Flows: Python functions decorated with @register_flow that define data processing pipelines
  • Tasks: Prefect-decorated functions that perform individual operations (querying, processing, exporting)
  • Flow Registry: Automatic discovery of flows via decorator registration

Package Structure

sous_chef/
├── flow.py              # Flow registry and decorators
├── flows/               # Flow definitions
│   └── keywords_demo_flow.py
├── tasks/               # Task library
│   ├── discovery_tasks.py
│   ├── keyword_tasks.py
│   ├── aggregator_tasks.py
│   └── export_tasks.py
└── secrets.py           # Secret management

tests/                   # Test suite
└── test_export_tasks.py

About

Configurable Data Analytics Pipeline

Topics

Resources

Code of conduct

Stars

Watchers

Forks