Motivation, references, and the bulk of explanation of methods are detailed in report.pdf.
Only the .ipynb file has been uploaded (no .py version). Python code written in a Jupyter notebook (using Python 3.7.3).
Libraries that must be installed to run all code are: matplotlib, numpy, scipy, and scikit-learn, all of which can be installed with pip. If not already installed, this block of code can be pasted into a cell above the imports of the notebook:
!pip install numpy
!pip install matplotlib
!pip install scikit-learn
!pip install scipy
Assuming all necessary dependencies have been installed, run all cells to make sure all functions are available; all cells except for the last should run in under a minute, while the last cell will take around 2 minutes to run. Running all cells will additionally run three "playground" cells which give example usage of the functions discussed below.
The primary function of interest to run is cluster_and_analyze which will cluster the data based on the parameters as well as display the dendrogram and list the purity, Rand index, and normalized mutual information measures of the clustering. cluster_and_analyze takes the following parameters:
normalized: boolean value, whenTruewill normalize the datadistance_func: function, should be one of the functions defined in the sectionCluster Similarity Functionsof the notebookdistance_metric: function, should be one of the functions defined in the sectionDistance Functionsof the notebookuse_weights: 1D number array of length 4, optional, by default[1,1,1,1], specifies the weights to use on the measures sepal length, sepal width, petal length, and petal width, in that order
The notebook also includes the varying of weights. There are two interesting functions to run in this section.
-
greedy_optimizeruns a greedy "optimization" of the four weights, whose explanation can be found in the Methods section of the report.greedy_optimizetakes the following parameters:clust_sim: function, should be one of the functions defined in the sectionCluster Similarity Functionsof the notebookdist_metric: function, should be one of the functions defined in the sectionDistance Functionsof the notebookthreshold: float value, the stopping condition for the function based on the difference of cluster evaluation values between each iterationclust_eval: function, should be one of the functions defined in the sectionCluster Evaluation Functionsof the notebook
-
plot_weight_rangeplots the cluster evaluation measures based on the parameters specified below:use_weights: 1D number array of length 4, specifies the weights to use on the measures sepal length, sepal width, petal length, and petal width, in that orderweight_index: the index ofuse_weightsto whichweight_rangewill apply (making the value at this index defined inuse_weightsarbitrary)weight_range: a list of values that specifies how the value ofuse_weightsat the desiredweight_indexwill vary. i.e. passing inrange(0,5)will use 0, then 1, then 2, then 3, then 4 foruse_weights[weight_index]during clusteringclust_sim: function, should be one of the functions defined in the sectionCluster Similarity Functionsof the notebookdist_metric: function, should be one of the functions defined in the sectionDistance Functionsof the notebook
Matplotlib: 3.3.2
Numpy: 1.16.4
SciPy: 1.2.1
SKLearn/SciKit-Learn: 0.21.2