In the past two lessons, you have learned what Unsupervised Machine Learning is, what problems are suitable for a solution based on Unsupervised Machine Learning, how to apply Unsupervised Machine Learning, and you have practiced implementing the basic phases of a solution using Scikit-learn. Now is time to put all that conceptual and procedural knowledge to work by doing a larger project. Choose a problem domain that motivates you, and build a complete solution implementing all the phases you learned about in previous chapters. We provide some ideas of interesting problem domains in a dedicated section in this lesson, but we want you to be creative and adventurous, and explore other options as well. This lesson does not present any new material: everything you will need to complete this project was discussed on previous lessons.
- Input requirement: capacity to read a dataset stored on disk.
- Output requirement: report on optimal number of clusters, centroid coordinates and quality metric.
- Output requirement: identifiers of classes corresponding to new instances classified by the model.
- The software must learn a clusterization a the dataset.
- The software must use the learned clusterization to classify new problem instances.
- The software must evaluate the quality of a clusterization.
- The software must be flexible to work with different preconfigured amount of clusters.
- The software must compare results using different numbers of clusters and determine which number of clusters is best.
- Use Python as programming language.
- Use Pandas for reading the dataset into a Pandas dataframe.
- Use Scikit-learn for training and testing the Machine Learning model.
- Python application that performs ETL, training, and testing.
- Report containing quality metrics, and explanation of the dataset, and the experimental procedure (range of the different number of clusters that were tested, how the range was traversed, etc.).
- Find an interesting dataset! Look in the Useful Resources section for sources of ideas.
- If you do not find a pre-existing dataset on the problem domain that you like, be creative: consider building the dataset yourself and donating the dataset to one of the public Machine Learning repositories.
- Break down the project into smaller tasks, for instance: importing the dataset, training, etc.
- Decide whether you will create a single Python application or several Python applications.
- Segment smartphone users according to phone usage and apps installed.
- Segment healthy person under 50 years of age according to their risk or propensity of suffering from Alzheimer's disease after 70 years of age.
- Classify computer network traffic as a means to detect patterns of anomalous flows.
- University of California at Irvine's Machine Learning Repository
- OpenML datasets
- Kaggle datasets
- Read dataset into Pandas dataframe and select the training set from it: 1 point.
- Model trained and evaluated: 2 points.
- Model used for estimation of new instances: 1 point.
- Different experiments performed using different amounts of clusters, to determine best choice: up to 2 points.
