Authors: Group 3 – Juan Camilo Luján, Laurenz Jakob, Raluca Gogosoiu, Silvia Mendoza & Stephan Pentchev
Course: MLOps – Master in Business Analytics and Data Science
Status: Production Refactoring Phase
Develop a scalable machine learning system that accurately predicts residential house sale prices using structured property data.
The solution aims to:
- Improve valuation accuracy
- Reduce reliance on manual appraisal heuristics
- Provide consistent, data-driven pricing
- Enable scalable deployment from Ames (Iowa) to state-wide and eventually nationwide markets
The long-term vision is to build a fully automated housing valuation engine for real estate agencies, lenders, and investors.
Primary users:
- Real estate agencies
- Mortgage lenders
- Property investors
- Individual sellers
Outputs:
- Automated sale price predictions
- Evaluation reports
- Scalable production-ready pipeline
- Reduce pricing error compared to manual estimation
- Decrease time-to-price by 30%
- Increase pricing consistency across neighborhoods
- Improve listing competitiveness
Model: LassoCV (Regularized Linear Regression)
New pipeline:
- R² (train) ≈ 0.804
Notebook before:
Validation Results:
- R² (train) ≈ 0.904
- R² (validation) ≈ 0.893
- 163 / 256 coefficients retained (automatic feature selection)
Target:
- R² ≥ 0.8
- Stable residual distribution
- Controlled bias across price segments
Residential housing data from Ames, Iowa.
- Target Variable:
SalePrice(USD) - 81 structured features including:
- Living area (GrLivArea)
- Neighborhood
- Year built
- Basement and garage features
- Overall quality indicators
-
Missing Value Handling
- Categorical → "None"
- Numerical → 0
-
Rare Category Grouping
- Categories with <10 observations grouped into "Other"
-
Log Transformation
- Applied to skewed features
- Target variable modeled in log-space
-
Multicollinearity Reduction
- Removed features with correlation > 0.8
-
One-Hot Encoding
- Applied to categorical variables
- No personal identifiable information (PII)
/data,/models,/reportsexcluded from Git
This project follows a strict separation between experimental work, production code, data layers, and testing.
.
├── README.md # Project documentation
├── LICENSE # License file
├── .gitignore # Git ignored files
├── config.yaml # Global configuration (paths, parameters)
├── environment.yml # Conda environment specification
├── .coveragerc # Configures the coverage tool itself
├── pytest.ini # Configures how pytest is running
│
├── data/ # Local data storage (ignored by Git)
│ ├── raw/ # Immutable original datasets
│ ├── processed/ # Cleaned datasets for training
│ └── inference/ # Unseen data for inference (no SalePrice)
│
├── models/ # Saved trained model artifacts (ignored by Git)
│
├── reports/ # Generated metrics, evaluation outputs, predictions
│
├── notebook/ # Experimental sandbox (Jupyter notebooks)
│ ├── HousePred-LassoReg.ipynb # Original baseline notebook
│ ├── experiment.pynb # notebook for data scientists (experimental)
│
├── src/ # Production pipeline (core ML system)
│ ├── __init__.py # Makes src a Python package
│ ├── api.py # FastAPI service — /health and /predict endpoints
│ ├── clean_data.py # Data preprocessing & feature preparation
│ ├── evaluate.py # Model evaluation and metrics handling
│ ├── features.py # Feature engineering utilities
│ ├── infer.py # Inference logic (prediction on new data)
│ ├── load_data.py # Data loading utilities
│ ├── logger.py # Dual-output logging (console + file)
│ ├── main.py # Pipeline orchestrator (training workflow)
│ ├── train.py # Model training logic
│ ├── utils.py # Helper utilities (e.g., path handling)
│ └── validate.py # Data validation checks
│
└── tests/ # Automated unit tests (pytest)
The full machine learning pipeline will eventually be executable through:
1. environment setup:
conda env create -f environment.yml
conda activate mlops_project
2. launch sandbox:
code notebook/HousePred-LassoReg.ipynb
3. test suite:
python -m pytest -q
4. orchestrator:
python -m src.main
The model is deployed as a live FastAPI service.
https://my-project1-mlops.onrender.com/docs
Use this interface to test the API interactively.
GET /health
Returns:
- service status
- model_loaded flag
- model version
POST /predict
Example request: { "records": [ { "Id": 1, "LotArea": 8450, "GrLivArea": 1710, "Neighborhood": "CollgCr", "OverallQual": 7, "YearBuilt": 2003 } ] }
Returns:
- predicted SalePrice
- model version
1. data/processed/clean.csv: The deterministically cleaned input data
2. models/model.joblib: The deployable pipeline artifact
3. reports/predictions.csv: The inference log containing predictions
Real Estate Agency / Property Valuation Firm
Industry: Real Estate & Financial Services
- Property Valuation
- Sales Strategy
- Investment Analysis
- Structured housing data available
- Spreadsheet-based pricing workflow
- Manual appraisal processes
- Limited automation
- Strong interest in analytics-driven valuation
Current pricing relies heavily on:
- Manual valuation
- Comparable property heuristics
- Subjective experience
- Inconsistent pricing
- Bias in high-end property estimates
- Time-consuming valuation process
- Missed revenue opportunities
- Validation R² ≥ 0.8
- Reduced pricing variance
- Faster listing preparation
- Improved pricing consistency
A scalable machine learning valuation engine that:
- Cleans and preprocesses structured housing data
- Identifies key value drivers
- Produces automated sale price predictions
- Ensures reproducible and modular MLOps pipeline execution
- REST API deployment
- Real-time pricing dashboard
- Multi-region training framework
- Production model for Ames
- Expand to Iowa statewide data
- Incorporate regional economic indicators
- Nationwide automated valuation engine
- Integration with real estate listing platforms
- Continuous retraining and monitoring
- Higher pricing accuracy
- Reduced manual workload
- Faster time-to-market
- More consistent pricing strategy
- Data-driven credibility
- Competitive advantage
- Scalable analytics infrastructure
- Reduced prediction error
- Improved R²
- Reduced listing cycle time
- AI Specialist
- ML Engineer
- Data Engineer
- Product Manager
- Real Estate Subject Matter Expert
Estimated development phase: 12+ weeks
Estimated pilot budget: $60k – $120k
- Data infrastructure
- Cloud compute
- Deployment environment
- Limited generalization beyond Ames
- Bias at price extremes
- Data quality variations across regions
- Model drift over time
- Continuous monitoring of residuals
- Scheduled retraining
- Bias audits
- Geographic data expansion
| Field | Details |
|---|---|
| Model type | Lasso Regression (regularized linear model) |
| Target | SalePrice (USD, modeled in log-space via log1p, inverted at inference with expm1) |
| Input features | OverallQual, YearBuilt, LotArea, GrLivArea, Neighborhood |
| Training data | Ames, Iowa residential housing dataset (data/raw/train.csv) |
| Train/test split | 80% train / 20% test, random_state=42 |
| Hyperparameter tuning | GridSearchCV over alpha ∈ [0.0005, 0.001, 0.005, 0.01, 0.05, 0.1], 5-fold KFold CV |
| Evaluation metrics | RMSE, MAE, R², RMSLE |
| Reported R² (train) | ≈ 0.804 |
| Reported R² (test) | ≈ 0.893 (baseline notebook) |
| Model registry | W&B artifact aliased prod under juan-lujan-/house-price-prediction |
| Inference | Served via FastAPI /predict endpoint, pulls prod artifact from W&B at startup |
| Known limitations | Trained on Ames, Iowa only — may not generalize to other markets without retraining |
| Fairness considerations | No demographic data used; Neighborhood encoding may reflect historical pricing biases |
- Initial production release
- Modular
src/pipeline:load_data,clean_data,validate,features,train,evaluate,infer main.pyorchestrates end-to-end training with W&B tracking and model artifact upload- FastAPI
/healthand/predictendpoints with Pydantic strict contract - Dual-output logging (console + file) via
src/logger.py— zeroprint()in production code - Docker containerization with
.dockerignorefor lean image - CI pipeline (
.github/workflows/ci.yml) runs tests and validates Docker build on every PR - CD pipeline (
.github/workflows/deploy.yml) deploys to Render on GitHub Release conda-lock.ymlfor fully reproducible Linux environment- 54 tests across all modules, 84% coverage
- W&B model artifact promoted with alias
prodfor production inference