Home

Jump to bottom Edit New page

patlas edited this page Dec 10, 2015 · 21 revisions

DBSCAN - python approach

Welcome on wiki page of DBSCAN clustering algorithm. Main target is prepareing our own implementation of DBSCAN clustering algorithm in object oriented script language like Python.

About DBscan - Wikipedia source

DBscan (Density-based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

Environment requirements:

Python 3.5
Matplotlib
Scipy
Numpy
Scikit

Example of usage

Comparison of generated clusters on random data

The only thing that is necessary to run comparison test is to start script scikitresults_test.py included in test directory. Effects of our approache and scikit solution show pictures below. test1 test2

Random data clustering

test3 test4

Time comparison between scikit and selve approach:

Number of samples: 100

Scikit average time of computing = 0.00025
Our version average time of computing = 0.00308
Our algorithm is 12.42 slower than algorithm from scikit

Number of samples: 500

Scikit average time of computing = 0.00143
Our version average time of computing = 0.08387
Our algorithm is 58.70 slower than algorithm from scikit

Number of samples: 1000

Scikit average time of computing = 0.00317
Our version average time of computing = 0.33529
Our algorithm is 105.76 slower than algorithm from scikit

Number of samples: 2000

Scikit average time of computing = 0.01239
Our version average time of computing = 1.33483
Our algorithm is 107.69 slower than algorithm from scikit

Number of samples: 3000

Scikit average time of computing = 0.01406
Our version average time of computing = 3.01206
Our algorithm is 214.27 slower than algorithm from scikit

Number of samples: 4000

Scikit average time of computing = 0.02061
Our version average time of computing = 5.69492
Our algorithm is 276.27 slower than algorithm from scikit

Number of samples: 5000

Scikit average time of computing = 0.01736
Our version average time of computing = 8.03304
Our algorithm is 462.74 slower than algorithm from scikit

Number of samples: 10000

Scikit average time of computing = 0.08937
Our version average time of computing = 35.53449
Our algorithm is 397.63 slower than algorithm from scikit

Plots based on data include above: comp1 comp2