Sensor Data ETL and Report API

This project implements an ETL infrastructure for sensor data with SQL-based reporting and a REST API.

Project Structure

.
├── data/                   # Sensor data files
├── etl/                    # ETL pipeline components
│   ├── base.py            # Abstract base classes
│   ├── extractors.py      # CSV data extraction
│   ├── transformers.py    # Data transformation
│   ├── loaders.py         # DuckDB loading
│   └── pipeline.py        # ETL orchestration
├── api/                    # FastAPI application
│   └── main.py            # REST API endpoints
├── reports/                # SQL reports
│   └── summary_report.sql # Summary report query
├── tests/                  # Unit and integration tests
│   ├── test_etl.py        # ETL tests
│   ├── test_report.py     # Report tests
│   └── test_api.py        # API tests
├── main.py                 # ETL entry point
├── generate_report.py      # Report generation script
└── verify_db.py           # Database verification script

Installation

pip install -r requirements.txt

Usage

Part 1: ETL Pipeline

Run the ETL to load sensor data into DuckDB:

python main.py

This creates sensors.db with the sensor_readings table.

Storage Target: DuckDB database file (sensors.db)

Table Schema:

CREATE TABLE sensor_readings (
    machine_code VARCHAR,
    component_code VARCHAR,
    coordinate VARCHAR,
    sample_time BIGINT,      -- Epoch microseconds
    value DOUBLE,
    inserted_at TIMESTAMP
);

Part 2: Generate Report

Generate the summary report:

python generate_report.py

This outputs the report to console and saves it to report.csv.

Part 3: API Endpoint

Start the API server:

uvicorn api.main:app --reload

Test the API:

curl -X POST "http://localhost:8000/report" \
  -F "sensor_files=@2024-01-01.csv" \
  -F "sensor_files=@2024-01-02.csv" \
  -F "sensors_metadata=@Sensors.csv" \
  -F "machines_metadata=@Machines.csv"

API Documentation: http://localhost:8000/docs

Testing

Run all tests:

pytest

Run specific test suites:

pytest tests/test_etl.py      # ETL tests
pytest tests/test_report.py   # Report tests
pytest tests/test_api.py      # API tests

Design & Extensibility

Abstract Base Classes

The ETL pipeline uses abstract base classes (Extractor, Transformer, Loader) to enable easy swapping of:

Data sources: CSV → JSON, Parquet, API, etc.
Processing engines: Polars → Pandas, DuckDB, Spark, etc.
Storage targets: DuckDB → PostgreSQL, MySQL, Parquet, etc.

Technology Stack

Polars: Fast, memory-efficient data processing
DuckDB: Embedded analytical database
FastAPI: Modern, high-performance API framework
PyArrow: Zero-copy data transfer between Polars and DuckDB

Report Output

The summary report includes:

machine_name: User-friendly machine name
coordinate: Coordinate with the highest increase
value_avg: Average value on 2024-01-02
increase_in_value: Increase from 2024-01-01 to 2024-01-02
samples_cnt: Number of samples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sensor Data ETL and Report API

Project Structure

Installation

Usage

Part 1: ETL Pipeline

Part 2: Generate Report

Part 3: API Endpoint

Testing

Design & Extensibility

Abstract Base Classes

Technology Stack

Report Output

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
api		api
data		data
etl		etl
reports		reports
tests		tests
.gitignore		.gitignore
README.md		README.md
generate_report.py		generate_report.py
main.py		main.py
report.csv		report.csv
requirements.txt		requirements.txt
verify_db.py		verify_db.py

thefrieddan1/herl

Folders and files

Latest commit

History

Repository files navigation

Sensor Data ETL and Report API

Project Structure

Installation

Usage

Part 1: ETL Pipeline

Part 2: Generate Report

Part 3: API Endpoint

Testing

Design & Extensibility

Abstract Base Classes

Technology Stack

Report Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages