HKDataMiner is a python library for constructing statistical models for biomolecular dynamics data. It was originally developed by Prof. Xuhui Huang's Group at HKUST.
Note: Prof. Xuhui Huang has moved to the University of Wisconsin–Madison. You can visit his current lab website here.
HKDataMiner was originally initiated in 2014 and built upon the Python 2 ecosystem. After serving the community for several years, the project entered a dormant phase without active maintenance.
Recently, we have revitalized the project with a major modernization:
- Python 3.10+ Support: The codebase has been fully ported from Python 2 to Python 3.10, ensuring compatibility with modern scientific computing stacks.
- Modern Engineering: We reorganized the package structure, added
mambaenvironment support, and introduced a unified CLI (hkdm) for better usability. - Active Maintenance: We are committed to further maintaining and updating HKDataMiner to support modern computational workflows and new research needs.
The core highlight of this package is APLoD (Adaptive Partitioning by Local Density-peaks), a highly efficient clustering algorithm designed specifically for analyzing large-scale Molecular Dynamics (MD) trajectories and constructing Markov State Models (MSMs).
Adaptive Partitioning by Local Density-peaks (APLoD) addresses the challenge of clustering ultra-large MD datasets containing millions of conformations.
Figure: Flowchart of the APLoD algorithm. The kNN search is efficiently performed by pre-organizing the dataset into a VP-tree structure. (Adapted from Liu et al., J. Comput. Chem. 2017)
-
Extreme Efficiency: APLoD reduces running time and memory usage by 2–3 orders of magnitude compared to standard Density Peaks (DP) algorithms. It achieves a temporal complexity of
$O(N \log N)$ and spatial complexity of$O(N)$ , making it feasible to run on standard desktops even for massive datasets. -
Adaptive Resolution: Unlike geometric clustering (e.g., K-Centers) which tends to partition space uniformly, APLoD is density-based. It produces clusters with adaptive sizes:
- High-density regions (energy minima): Finer resolution with smaller clusters.
- Low-density regions (transition states): Coarser resolution with larger clusters.
- This automatically minimizes statistical error within clusters, preserving kinetic boundaries.
-
Local Density Estimation: By utilizing k-Nearest-Neighbors (kNN) search (via VP-trees), APLoD estimates density locally. This avoids the quadratic
$O(N^2)$ cost of global density estimation found in traditional methods.
- APLoD (Recommended): Best for large-scale datasets. Groups conformations based on local density peaks.
- K-Centers: Standard geometric clustering minimizing the maximum cluster radius. Useful for ensuring uniform coverage of conformational space.
Once microstates are identified, HKDataMiner provides algorithms to lump them into kinetically metastable macrostates:
- PCCA (Perron Cluster Cluster Analysis): Spectral method based on the transition probability matrix eigenvalues.
- PCCA+: Robust extension of PCCA with fuzzy memberships.
- Spectral Clustering: Graph-based clustering.
- Ward: Hierarchical clustering.
A specialized module for cryo-EM data distribution analysis (requires Xmipp).
We recommend using mamba (or conda) to manage the environment.
# 1. Clone the repository
git clone https://github.com/liusong299/HK_DataMiner.git
cd HK_DataMiner
# 2. Create the environment (Python 3.10)
mamba env create -f environment.yml
mamba activate hkdataminer-py310
# 3. Install in editable mode
pip install -e .We provide a one-line command to run a full tutorial on the Alanine Dipeptide dataset. This workflow unpacks data, performs clustering, builds an MSM via lumping, and plots the results.
hkdm tutorialResults will be saved in _tutorial_run/Tutorial/, including:
assignments_*.txt: Microstate assignments for every frame.cluster_centers.pdb: Structures of cluster centers.PCCA_*_Matrix.png: Transition probability matrix of the macrostates.*_Metastability_Modularity.txt: MSM quality metrics.
HKDataMiner provides a unified command-line interface hkdm.
hkdm cluster aplod \
--trajlist trajlist.txt \
--atomlist atom_indices.txt \
--topology native.pdb \
--rho-cutoff 1.0 \
--delta-cutoff 1.0 \
--n-neighbors 100 \
--output-dir ./resultshkdm cluster kcenters \
--trajlist trajlist.txt \
--atomlist atom_indices.txt \
--topology native.pdb \
--n-clusters 100 \
--output-dir ./resultshkdm lump pcca \
--assignments ./results/assignments_kcenters_n_100.txt \
--traj-len ./results/traj_len.txt \
--n-macro 4 \
--homedir ./resultsIf you use APLoD or HKDataMiner in your research, please cite:
@article{liu2017adaptive,
title={Adaptive partitioning by local density-peaks: An efficient density-based clustering algorithm for analyzing molecular dynamics trajectories},
author={Liu, Song and Zhu, Lizhe and Sheong, Fu Kit and Wang, Wei and Huang, Xuhui},
journal={Journal of Computational Chemistry},
volume={38},
number={3},
pages={152--160},
year={2017},
publisher={Wiley Online Library}
}This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
- Prof. Xuhui Huang - Project Leader - University of Wisconsin–Madison (Previously HKUST)
- Dr. Song Liu - Developer - song.liu@connect.ust.hk