Skip to content
patlas edited this page Dec 10, 2015 · 21 revisions

DBSCAN - python approach

Welcome on wiki page of DBSCAN clustering algorithm. Main target is prepareing our own implementation of DBSCAN clustering algorithm in object oriented script language like Python.

About DBscan - Wikipedia source

DBscan (Density-based spatial clustering of applications with noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

Environment requirements:

  • Python 3.5
  • Matplotlib
  • Scipy
  • Numpy
  • Scikit

Example of usage

Comparison of generated clusters on random data

The only thing that is necessary to run comparison test is to start script scikitresults_test.py included in test directory. Effects of our approache and scikit solution show pictures below. test1 test2

Random data clustering

test3 test4

Time comparison between scikit and selve approach:

Number of samples: 100

  • Scikit average time of computing = 0.00025
  • Our version average time of computing = 0.00308
  • Our algorithm is 12.42 slower than algorithm from scikit

Number of samples: 500

  • Scikit average time of computing = 0.00143
  • Our version average time of computing = 0.08387
  • Our algorithm is 58.70 slower than algorithm from scikit

Number of samples: 1000

  • Scikit average time of computing = 0.00317
  • Our version average time of computing = 0.33529
  • Our algorithm is 105.76 slower than algorithm from scikit

Number of samples: 2000

  • Scikit average time of computing = 0.01239
  • Our version average time of computing = 1.33483
  • Our algorithm is 107.69 slower than algorithm from scikit

Number of samples: 3000

  • Scikit average time of computing = 0.01406
  • Our version average time of computing = 3.01206
  • Our algorithm is 214.27 slower than algorithm from scikit

Number of samples: 4000

  • Scikit average time of computing = 0.02061
  • Our version average time of computing = 5.69492
  • Our algorithm is 276.27 slower than algorithm from scikit

Number of samples: 5000

  • Scikit average time of computing = 0.01736
  • Our version average time of computing = 8.03304
  • Our algorithm is 462.74 slower than algorithm from scikit

Number of samples: 10000

  • Scikit average time of computing = 0.08937
  • Our version average time of computing = 35.53449
  • Our algorithm is 397.63 slower than algorithm from scikit

Plots based on data include above: comp1 comp2