The goal of this assignment is to see how you apply practical data engineering skills in a real-world-ish setting. You'll build a data pipeline for a simplified part of our product, with realistic expectations for data quality, analytics readiness, and completeness.
Modern greenhouses rely on lots of sensors to monitor both the external weather and the internal environment. These parameters—like temperature, humidity, CO₂ levels, and light—are influenced by both external weather and internal control systems (screens, windows, heating, etc).
We want to give growers insight into how the environment evolves over time, so they can assess and improve their growing strategy. This assignment focuses on building the data infrastructure for analytics: ingesting, transforming, and modeling sensor (meteo) data for analytical queries.
Build an ETL/ELT data pipeline with the following functionality:
- Extract: Read raw meteo sensor data from provided JSON files
- Handle bulk data loading scenarios (batch processing)
- Design for idempotency - rerunning the pipeline should produce the same results
- Support incremental loads (process only new data since last run)
Implement a data quality framework that:
- Completeness checks: Detect missing timestamps, gaps in sensor readings
- Validity checks: Flag impossible values (e.g., temperature < -50°C or > 70°C, humidity > 100%)
- Consistency checks: Identify sudden spikes or drops that indicate sensor malfunction
- Duplicate detection: Handle duplicate readings with the same timestamp.
- Anomaly flagging: Mark outliers for review without blocking the pipeline
Transform raw sensor readings into analytics-ready datasets:
Raw Data Layer: Store ingested data with minimal transformation (staging)
Aggregated Views: Create pre-aggregated tables at multiple time grains:
- 15-minute aggregations: min, max, avg, stddev per parameter
- Hourly aggregations: min, max, avg, stddev per parameter
Business Logic:
- Calculate a daily derived metric (e.g., daily temperature range = max - min)
Design a data model optimized for analytical queries:
- Choose between normalized vs. denormalized approaches (explain trade-offs)
- Consider dimensional modeling patterns (fact tables, dimension tables)
- Implement appropriate partitioning strategy (e.g., by date)
- Add indexes to optimize common query patterns
- Document how your schema would handle new sensor types being added
Demonstrate your data model by implementing SQL queries or DataFrame transformations that answer:
- What was the average temperature per day over the past 7 days?
- Which days had the highest temperature variance (max - min)?
- What percentage of expected sensor readings are missing per day?
- Show a time-series of hourly temperature and humidity for the last 24 hours
- Identify the top 3 most anomalous readings and explain why they're flagged
- Structure your code as a testable ETL pipeline with clear stages (extract, transform, load)
- Ensure the pipeline can be run end-to-end with a single command
- Use Python (and whichever Python libraries you deem appropriate)
- Choose appropriate storage and be able to justify your choice
Other Guidelines:
- Focus on data engineering fundamentals. Write clear, maintainable data transformation logic.
- Design for scale in your data modeling choices (document how your design would handle 1000 greenhouses with 50 sensors each at 1-minute resolution)
- You're encouraged to use LLM-assisted tools, but you own the code. Don't leave behind things you don't understand, and be ready to explain your choices.
- Timebox to ~4 hours. You don't need to perfect every detail — just be clear about trade-offs.
Provide a README.md with:
- Setup & run instructions: How to install dependencies and run the pipeline
- Testing instructions: How to run tests
- Architecture overview:
- Data quality approach: What checks you implemented and why
- Assumptions & trade-offs: What you prioritized and what you deferred
- Next steps: What you would do with 3 more months:
- How would you handle slowly changing dimensions (e.g., sensor calibration changes)?
- How would you implement data retention policies?
- How would you monitor pipeline health in production?
- What optimizations would you add for billion-row scale?
Submission:
- Send us a
.git bundlevia email - In the technical interview you will briefly present your solution to the interviewers as the basis for technical discussion
Note: We adjust expectations based on experience level. A staff/senior data engineer is expected to cover more ground and make more sophisticated architectural decisions than someone early in their career — especially now that LLMs can help you move faster.
Use the sample data available here:
https://drive.google.com/drive/folders/1TV20EVuxcmaqro7e0HfmX2j6HtILNwXB?usp=sharing
The data represents sensor readings from greenhouse monitoring systems with timestamps, sensor IDs, parameter types (temperature, humidity, CO₂, light), and measured values.
Good luck! Feel free to email any clarifying questions to your Source.ag contact