A technical exercise for data engineering candidates on the DataHub team.
To assess a candidate's knowledge and skills when using technologies we use on the DataHub team. We want to get a sense for:
- Ability to code SQL using CTEs (Common Table Expressions)
- Ability to understand our transformation pipeline's structure and stages
- Healthcare data domain familiarity
- DBT (Data Build Tool) proficiency
- Git workflow and version control practices
This exercise simulates our real-world data transformation pipeline. You'll work with healthcare data (patients, encounters, diagnoses) and transform it through multiple stages using DBT and SQL, following the same patterns we use in production.
Time Limit: 1 hour
Note: Focus on completing the staged layer first, then intermediate, then final if time permits. Quality over quantity - we'd rather see one layer done well than all layers done poorly.
- Python 3.8 or higher
- Git
- A text editor or IDE of your choice
git clone <repository-url>
cd datahub-interview
# Create a branch with your name (use underscores, no spaces)
git checkout -b firstname_lastnameExample: If your name is Jane Doe, use: git checkout -b jane_doe
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txt# Copy the profiles.yml to your home directory
mkdir -p ~/.dbt
cp profiles.yml ~/.dbt/profiles.yml
# Or set the DBT_PROFILES_DIR environment variable
export DBT_PROFILES_DIR=$(pwd)If on Windows, this environment variable can be set as follows:
Powershell:
$env:DBT_PROFILES_DIR = (Get-Location)CMD prompt:
set DBT_PROFILES_DIR=%cd%dbt debugYou should see "All checks passed!"
Our transformation pipeline follows a four-stage architecture:
Seeds (CSV files)
↓
Raw Layer (ephemeral)
↓
Staged Layer (cleaned & standardized)
↓
Intermediate Layer (joined & business logic)
↓
Final Layer (analytics-ready aggregations)
- Raw models:
raw__<entity>.sql(materialized as ephemeral, note the double underscore) - Staged models:
stg__<entity>.sql(materialized as tables) - Intermediate models:
int__<entity>.sql(materialized as tables) - Final models:
<entity>.sql(materialized as tables, no prefix)
You will complete the transformation logic for three layers: staged, intermediate, and final.
Key Requirements:
- Handle data quality issues (NULLs, formatting inconsistencies)
- Make the correct joins, being aware of where the one-to-many relationships are between data
- Produce the final table of patient-centered encounter and diagnostic data.
Preferred Approaches
- Use CTEs to structure your queries
- Demonstrate proper use of the macro that are in the macros directory in your queries
Complete the following models in models/staged/:
- Clean and standardize patient demographics
- Ensure that all phone numbers are formatted as XXX-XXX-XXXX
- Handle NULL email addresses
- Cast date fields appropriately
- Add timestamps
- Clean and standardize encounter data
- Filter for completed encounters only
- Standardize encounter types
- Cast date fields appropriately
- Clean and standardize diagnosis data
- Uppercase and trim ICD-10 codes
- Cast date and boolean fields appropriately
Complete models/intermediate/int__patient_encounter.sql:
- Join patient, encounter, and diagnosis data
- Calculate patient age at time of encounter
- Identify primary diagnoses
- Count total diagnoses per encounter
- Flag encounters with chronic conditions
- Chronic conditions for this exercise:
I10,E11.9,E11.65,J44.9,J44.1,I50.9
- Chronic conditions for this exercise:
Key Requirements:
- Use multiple CTEs for logical organization
- Demonstrate proper join techniques
- Apply business logic calculations
Complete models/final/patient_encounter_summary.sql:
- Create patient-level summary statistics
- Aggregate encounter counts, diagnosis counts
- Calculate temporal metrics (first/last encounter dates)
- Identify most common encounter types
- Count emergency visits
Key Requirements:
- Use CTEs for complex aggregations
- Demonstrate window functions or grouping techniques
- Create analytics-ready output
dbt seeddbt run# Run staged layer only
dbt run --select staged.*
# Run a specific model
dbt run --select stg_patientdbt testWhen you have completed the exercise, follow these steps:
Generate a CSV file from your final database with your name:
# Replace firstname and lastname with your actual name (lowercase, underscores)
sqlite3 main_final.db ".mode csv" ".headers on" \
".output firstname_lastname.csv" \
"SELECT * FROM patient_encounter_summary;" \
".quit"Example: For Jane Doe, the file should be named jane_doe.csv
# Add your models and CSV file
git add models/staged/ models/intermediate/ models/final/
git add firstname_lastname.csv
# Commit with a meaningful message
git commit -m "Complete DBT interview exercise - [Your Name]"# Push your branch to remote
git push origin <firstname>_<lastname>Go to the Datahub-Interview repo and open a pull request.
- Completed SQL Models: All TODOs in the staged, intermediate, and final models implemented
- Clean, Readable SQL: Proper CTE usage and clear structure
- Working Pipeline: All models run successfully with
dbt run(no errors) - CSV Output: Your results exported as
<firstname>_<lastname>.csv - Git History: Meaningful commit messages showing your work
- PR Interaction: How you handle comments on your pull request
-
Date calculations: Use
julianday()for date arithmetic-- Calculate age (julianday('now') - julianday(dob)) / 365.25
-
String functions:
substr(),replace(),trim(),upper(),lower() -
NULL handling:
coalesce(),ifnull()
# View compiled SQL
dbt compile
# Run with debug logging
dbt run --debug
# View logs
cat logs/dbt.log- Phone number cleaning: Remove all non-numeric characters before formatting
- Date casting: SQLite stores dates as text, cast appropriately
- Boolean handling: SQLite uses 0/1 for booleans
- CTEs: Remember to reference them in your final SELECT
If you have questions during the exercise, please reach out to your interview coordinator.
Good luck!