Assignment: Greenhouse Sensor Data Pipeline

The goal of this assignment is to see how you apply practical data engineering skills in a real-world-ish setting. You'll build a data pipeline for a simplified part of our product, with realistic expectations for data quality, analytics readiness, and completeness.

Context

Modern greenhouses rely on lots of sensors to monitor both the external weather and the internal environment. These parameters—like temperature, humidity, CO₂ levels, and light—are influenced by both external weather and internal control systems (screens, windows, heating, etc).

We want to give growers insight into how the environment evolves over time, so they can assess and improve their growing strategy. This assignment focuses on building the data infrastructure for analytics: ingesting, transforming, and modeling sensor (meteo) data for analytical queries.

Your Task

Build an ETL/ELT data pipeline with the following functionality:

1. Data Ingestion

Extract: Read raw meteo sensor data from provided JSON files
Handle bulk data loading scenarios (batch processing)
Design for idempotency - rerunning the pipeline should produce the same results
Support incremental loads (process only new data since last run)

2. Data Quality & Validation

Implement a data quality framework that:

Completeness checks: Detect missing timestamps, gaps in sensor readings
Validity checks: Flag impossible values (e.g., temperature < -50°C or > 70°C, humidity > 100%)
Consistency checks: Identify sudden spikes or drops that indicate sensor malfunction
Duplicate detection: Handle duplicate readings with the same timestamp.
Anomaly flagging: Mark outliers for review without blocking the pipeline

3. Data Transformation & Aggregation

Transform raw sensor readings into analytics-ready datasets:

Raw Data Layer: Store ingested data with minimal transformation (staging)

Aggregated Views: Create pre-aggregated tables at multiple time grains:

15-minute aggregations: min, max, avg, stddev per parameter
Hourly aggregations: min, max, avg, stddev per parameter

Business Logic:

Calculate a daily derived metric (e.g., daily temperature range = max - min)

4. Data Modeling for Analytics

Design a data model optimized for analytical queries:

Choose between normalized vs. denormalized approaches (explain trade-offs)
Consider dimensional modeling patterns (fact tables, dimension tables)
Implement appropriate partitioning strategy (e.g., by date)
Add indexes to optimize common query patterns
Document how your schema would handle new sensor types being added

5. Analytical Queries

Demonstrate your data model by implementing SQL queries or DataFrame transformations that answer:

What was the average temperature per day over the past 7 days?
Which days had the highest temperature variance (max - min)?
What percentage of expected sensor readings are missing per day?
Show a time-series of hourly temperature and humidity for the last 24 hours
Identify the top 3 most anomalous readings and explain why they're flagged

6. Pipeline Orchestration & Testing

Structure your code as a testable ETL pipeline with clear stages (extract, transform, load)
Ensure the pipeline can be run end-to-end with a single command

Requirements

Use Python (and whichever Python libraries you deem appropriate)
Choose appropriate storage and be able to justify your choice

Other Guidelines:

Focus on data engineering fundamentals. Write clear, maintainable data transformation logic.
Design for scale in your data modeling choices (document how your design would handle 1000 greenhouses with 50 sensors each at 1-minute resolution)
You're encouraged to use LLM-assisted tools, but you own the code. Don't leave behind things you don't understand, and be ready to explain your choices.
Timebox to ~4 hours. You don't need to perfect every detail — just be clear about trade-offs.

Deliverables

Provide a README.md with:

Setup & run instructions: How to install dependencies and run the pipeline
Testing instructions: How to run tests
Architecture overview:
Data quality approach: What checks you implemented and why
Assumptions & trade-offs: What you prioritized and what you deferred
Next steps: What you would do with 3 more months:
- How would you handle slowly changing dimensions (e.g., sensor calibration changes)?
- How would you implement data retention policies?
- How would you monitor pipeline health in production?
- What optimizations would you add for billion-row scale?

Submission:

Send us a .git bundle via email
In the technical interview you will briefly present your solution to the interviewers as the basis for technical discussion

Note: We adjust expectations based on experience level. A staff/senior data engineer is expected to cover more ground and make more sophisticated architectural decisions than someone early in their career — especially now that LLMs can help you move faster.

Use the sample data available here:
https://drive.google.com/drive/folders/1TV20EVuxcmaqro7e0HfmX2j6HtILNwXB?usp=sharing

The data represents sensor readings from greenhouse monitoring systems with timestamps, sensor IDs, parameter types (temperature, humidity, CO₂, light), and measured values.

Good luck! Feel free to email any clarifying questions to your Source.ag contact

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment: Greenhouse Sensor Data Pipeline

Context

Your Task

1. Data Ingestion

2. Data Quality & Validation

3. Data Transformation & Aggregation

4. Data Modeling for Analytics

5. Analytical Queries

6. Pipeline Orchestration & Testing

Requirements

Deliverables

FilesExpand file tree

assignment.md

Latest commit

History

assignment.md

File metadata and controls

Assignment: Greenhouse Sensor Data Pipeline

Context

Your Task

1. Data Ingestion

2. Data Quality & Validation

3. Data Transformation & Aggregation

4. Data Modeling for Analytics

5. Analytical Queries

6. Pipeline Orchestration & Testing

Requirements

Deliverables