This repo contains a class that implements k-nearest neighbors regression that avoids time leakage.
You can run python house_price_prediction.py in the command line to see the demo.
pandasnumpysklearntqdm
-
What is the performance of the model measured in Median Relative Absolute Error?
The Relative Median Absolute Error (RMAE) is 0.2128.
-
What would be an appropriate methodology to determine the optimal k?
We can use grid search to determine the optimal k, which performs cross-validation and choose the optimal k with lowest MRAE. Execute the following commands:
from sklearn.grid_search import GridSearchCV parameters = [{'k': np.arange(1,10)}] clf = GridSearchCV(housePricePredictor(), parameters, n_jobs=-1) clf.fit(X_train, y_train) clf.best_params_ -
Do you notice any spatial or temporal trends in error?
There is no noticable spatial or temporal trends in error.
-
How would you improve this model?
- Improve the accuracy:
- Adding more features to the model
- Using grid search cross-validation to find optimal parameters for the model.
- Improve the efficiency:
- Precomputing the distance matrix.
- Improve the accuracy:
-
How would you productionize this model?
Since our K-nearest Neighbors model is computationally intensive, we need to speed up the program. There are several ways we could try:
- Using
cythonto speed up computation - Using
numbato speed up computation - Using processes in parallel with
ProcessPoolExecutor - Using Apache Spark framework
- Using