Name	Name	Last commit message	Last commit date
parent directory ..
your-code	your-code
README.md	README.md

Guided Project: Clustering

Overview

In the past two lessons, you have learned what Unsupervised Machine Learning is, what problems are suitable for a solution based on Unsupervised Machine Learning, how to apply Unsupervised Machine Learning, and you have practiced implementing the basic phases of a solution using Scikit-learn. Now is time to put all that conceptual and procedural knowledge to work by doing a larger project. Choose a problem domain that motivates you, and build a complete solution implementing all the phases you learned about in previous chapters. We provide some ideas of interesting problem domains in a dedicated section in this lesson, but we want you to be creative and adventurous, and explore other options as well. This lesson does not present any new material: everything you will need to complete this project was discussed on previous lessons.

External Interface Requirements

Input requirement: capacity to read a dataset stored on disk.
Output requirement: report on optimal number of clusters, centroid coordinates and quality metric.
Output requirement: identifiers of classes corresponding to new instances classified by the model.

Functional Requirements

The software must learn a clusterization a the dataset.
The software must use the learned clusterization to classify new problem instances.
The software must evaluate the quality of a clusterization.
The software must be flexible to work with different preconfigured amount of clusters.
The software must compare results using different numbers of clusters and determine which number of clusters is best.

Technical Requirements

Use Python as programming language.
Use Pandas for reading the dataset into a Pandas dataframe.
Use Scikit-learn for training and testing the Machine Learning model.

Necessary Deliverables

Python application that performs ETL, training, and testing.
Report containing quality metrics, and explanation of the dataset, and the experimental procedure (range of the different number of clusters that were tested, how the range was traversed, etc.).

Suggestions to Get Started

Find an interesting dataset! Look in the Useful Resources section for sources of ideas.
If you do not find a pre-existing dataset on the problem domain that you like, be creative: consider building the dataset yourself and donating the dataset to one of the public Machine Learning repositories.
Break down the project into smaller tasks, for instance: importing the dataset, training, etc.
Decide whether you will create a single Python application or several Python applications.

Potential Project Ideas

Segment smartphone users according to phone usage and apps installed.
Segment healthy person under 50 years of age according to their risk or propensity of suffering from Alzheimer's disease after 70 years of age.
Classify computer network traffic as a means to detect patterns of anomalous flows.

Useful Resources

University of California at Irvine's Machine Learning Repository
OpenML datasets
Kaggle datasets

Rubric

Read dataset into Pandas dataframe and select the training set from it: 1 point.
Model trained and evaluated: 2 points.
Model used for estimation of new instances: 1 point.
Different experiments performed using different amounts of clusters, to determine best choice: up to 2 points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Guided Project: Clustering

Overview

External Interface Requirements

Functional Requirements

Technical Requirements

Necessary Deliverables

Suggestions to Get Started

Potential Project Ideas

Useful Resources

Rubric

FilesExpand file tree

clustering-project

Directory actions

More options

Directory actions

More options

Latest commit

History

clustering-project

Folders and files

parent directory

README.md

Guided Project: Clustering

Overview

External Interface Requirements

Functional Requirements

Technical Requirements

Necessary Deliverables

Suggestions to Get Started

Potential Project Ideas

Useful Resources

Rubric