Skip to content

ENH: hierarchical clustering #11

@kzsigmond

Description

@kzsigmond

Speeding up hierarchical clustering

The python version of hierarchical clustering could potentially be faster. The code calculates the similarity in max_indices with gen_sim_dict. We are calculating all the similarity indices in the dictionary with the default k=1 value. If I am not mistaken, calculate_isim calculates the same for RR, JT, SM indices as gen_sim_dict in this case.

Proposed change

replace line 67 in iSIM/iSIM/clustering.py with:
s = calculate_isim(data=fp1+fp2, n_objects=n, n_ary=n_ary)

Initial timing results

This is for 50 fingerprints (size 2048):

  • gen_sim_dict took 5.3 seconds
  • calculate_isim took 1.2 seconds

Potential issues

We would not be able to do hierarchical clustering with the other similarities in gen_sim_dict and/or try different k values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions