A production-ready reference architecture for real-time feature engineering, combining streaming processing with low-latency serving.
Data Flow:
Kafka (Ingestion) β Flink (Stateful Processing) β Redis (Feature Store) β FastAPI (Serving)
Observability:
Jaeger (Distributed Tracing) + Prometheus (Metrics) + Grafana (Dashboards)
| Feature | Implementation Detail |
|---|---|
| Exactly-once | Flink Checkpointing + Transactional Connectors + Idempotent Redis Writes. |
| Backfill | Unified Source API (supports rewinding offsets or swapping to S3/Iceberg sources). |
| Schema Evolution | Confluent Schema Registry (Avro) ensures data compatibility. |
| Latency Monitor | End-to-End Tracing (OpenTelemetry) + Prometheus Metrics + Grafana Dashboards. |
| Cost Model | Serving layer tracks compute/network latency per feature with Prometheus histograms. |
| Batch Inference | High-performance batch API using Redis pipelines. |
| SDK Compiler | Auto-generate Flink SQL jobs from Python feature definitions. |
| π€ LLM Copilot | Natural language feature generation, recommendations, and interactive assistant. |
- Docker & Docker Compose
- Java 11 (Required for PyFlink Local Execution)
- Python 3.8+
Launch Kafka, Zookeeper, Redis, Schema Registry, and Monitoring stack.
cd deployment
docker-compose up -dPyFlink requires specific JARs to interact with Kafka and Schema Registry.
cd processing
# 1. Install Python Deps
pip install -r requirements.txt
# 2. Download Java Dependencies
bash download_libs.shNote: Set JAVA_HOME to your Java 11 installation.
# macOS Example
export JAVA_HOME="/opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk/Contents/Home"
export PATH="$JAVA_HOME/bin:$PATH"
python processing/feature_job.pyExpected Output: "Submitting Job..." and stays running.
Simulate user behavior (Clicks/Views).
python ingestion/producer.pyStart the HTTP Service.
python -m uvicorn serving.app:app --host 0.0.0.0 --port 8000curl http://localhost:8000/features/1
# {"user_id":"1","click_count":5,"source":"feature_store"}Optimized for ranking models - fetch features for many users in one request.
curl -X POST http://localhost:8000/features/batch \
-H "Content-Type: application/json" \
-d '{"user_ids": ["1", "2", "3", "4", "5"]}'
# {"results":[...], "latency_ms": 1.23, "batch_size": 5}curl http://localhost:8000/metricsA pre-built dashboard is available at monitoring/grafana-dashboard.json.
- Open Grafana: http://localhost:3000 (admin/admin)
- Go to Dashboards β Import
- Upload
monitoring/grafana-dashboard.json - Select Prometheus as the data source
- Feature Latency (p50/p95/p99) - Time series of serving latency
- Request Rate by Endpoint - Single vs Batch requests
- Feature Store Hit Rate - Percentage of requests served from Redis vs defaults
- Batch Request Size Distribution - Histogram of batch sizes
Define features in Python, auto-generate Flink SQL jobs.
python -m sdk.compiler --list
# Available FeatureViews:
# - user_click_stats
# β’ click_count_1m: Total clicks in the last minute
# β’ view_count_1m: Total views in the last minute
# - user_purchase_stats
# β’ purchase_count_5m: Total purchases in the last 5 minutespython -m sdk.compiler -f user_click_stats -o processing/generated_user_click_stats.pyEdit sdk/feature_definition.py:
from sdk import FeatureView, Feature, Source, Window, AggregationType, WindowType
my_feature = FeatureView(
name="my_feature_view",
source=Source(topic="my_topic"),
features=[
Feature(
name="my_count",
description="Count of events",
aggregation=AggregationType.COUNT,
source_field="*"
)
],
window=Window(type=WindowType.TUMBLE, size_minutes=5),
entity_key="user_id"
)The platform includes an LLM-powered assistant for feature engineering.
# Install LLM dependencies
pip install openai anthropic
# Set API key
export OPENAI_API_KEY="sk-..."
# OR
export ANTHROPIC_API_KEY="..."Chat with the SDK Copilot for help with feature definitions:
python sdk/cli.py chat
# π€ Feature Platform SDK Copilot
# Ask me anything about feature engineering!
#
# π§ You: How do I create a feature that counts purchases per hour?
# π€ Copilot: Here's how you can create a purchase count feature...python sdk/cli.py generate
# π Describe your features:
# > I want to track clicks and views per user in 5-minute windows
#
# π¦ Generated Code:
# from sdk import FeatureView, Feature, ...python sdk/cli.py recommend
# π― Use Case: CTR prediction for e-commerce
#
# π‘ Feature Recommendations:
# 1. click_through_rate_1h - Recent engagement signal
# 2. purchase_frequency_7d - Long-term conversion indicator
# ...The LLM features are also available via REST API:
| Endpoint | Method | Description |
|---|---|---|
/api/llm/generate |
POST | Generate feature code from description |
/api/llm/recommend |
POST | Get AI-powered feature recommendations |
/api/llm/chat |
POST | Chat with SDK Copilot |
/api/llm/health |
GET | Check LLM API availability |
Example API call:
curl -X POST http://localhost:8000/api/llm/generate \
-H "Content-Type: application/json" \
-d '{
"description": "Count user clicks and purchases in the last 10 minutes",
"source_schema": {
"user_id": "STRING",
"action_type": "STRING",
"timestamp": "BIGINT"
}
}'online_data_infra/
βββ ingestion/ # Kafka Producers (Avro + Schema Registry)
βββ processing/ # PyFlink Jobs (SQL + DataStream Hybrid API)
β βββ feature_job.py # Main Pipeline: Aggregation -> Redis Sink
β βββ generated_*.py # Auto-generated jobs from SDK
β βββ lib/ # Dependency JARs
βββ serving/ # FastAPI + OpenTelemetry + Prometheus Metrics
β βββ app.py # Single + Batch endpoints
β βββ llm_api.py # LLM-powered feature engineering API
βββ sdk/ # Feature Definition SDK
β βββ feature_definition.py # Define features here
β βββ compiler.py # Transpiles Python β Flink SQL
β βββ llm_integration.py # LLM client and agents
β βββ cli.py # Interactive CLI tool
βββ monitoring/ # Grafana dashboards, Prometheus configs
β βββ grafana-dashboard.json
βββ deployment/ # Docker Compose Environment
| Service | URL |
|---|---|
| Feature API | http://localhost:8000 |
| API Docs (Swagger) | http://localhost:8000/docs |
| LLM API Docs | http://localhost:8000/api/docs |
| Prometheus Metrics | http://localhost:8000/metrics |
| Grafana | http://localhost:3000 |
| Jaeger (Traces) | http://localhost:16686 |
| Schema Registry | http://localhost:8082 |
| Flink Dashboard | http://localhost:8081 |
- Streaming: Apache Flink 1.18 (PyFlink)
- Messaging: Kafka + Confluent Schema Registry
- Storage: Redis (Online Store)
- Serving: FastAPI + Prometheus + OpenTelemetry
- Observability: Jaeger, Prometheus, Grafana
- AI/LLM: OpenAI GPT-4o / Anthropic Claude 3
See NEXT_ITERATION_PLAN.md for the roadmap towards v2.