This repository provides tools for GPU-accelerated Leiden community detection using the RAPIDS package. It focuses on clustering encoded vectors from high-dimensional data, with specific scripts tailored for use on Jean-Zay.
Set up the environment with the following commands:
conda create -n cuGraph_Leiden python=3.10
conda activate cuGraph_Leiden
conda install -c rapidsai -c nvidia -c conda-forge cudf cugraph cumlInstall additional tools for development and visualization:
conda install anaconda::jupyter
conda install anaconda::seaborn- For V100 GPU:
module load rapids/24.04
- For A100 GPU:
module load cpuarch/amd rapids/24.04
The computation of Leiden partitions is managed by the script LeidenCom_accGPU_job_array.py, exploring various hyperparameters via job array submission. The script enables:
- Tile Importation: Import the projections to be clustered.
- Graph Creation:
Build an undirected graph based on the K-nearest neighbors (KNN) of encoded vectors.
Parameters:D: Number of dimensions of the encoded vectors (Note: Encoded vectors must occupy the firstDcolumns of the input matrix).K: Number of neighbors (specified in the configuration file via--conf_files).
- Leiden Partitioning:
Parameters:
res_leiden: Resolution for the Leiden algorithm.max_iter: Maximum number of iterations.
- Metric Computation: Assess clustering quality.
The main function saves:
- Partitions:
{outdir_exp}/partitions_{r}.csv - Centroids:
{outdir_exp}/centroid_{r}.csv - Clustering Metrics:
{outdir_exp}/clustering_scores_{r}.csv
Note:
outdir_expis a command-line argument.
SLURM job arrays handle hyperparameter parallelization. Specify the configuration file path with --conf_files and the job ID with --conf_nb. The configuration file has the following format:
| K | res_leiden | max_iter | |
|---|---|---|---|
| 0 | 75 | 0.05 | 100000 |
| 1 | 75 | 0.125 | 100000 |
| 2 | 75 | 0.25 | 100000 |
| 3 | 75 | 0.5 | 100000 |
To compute Leiden partitions for a random subset of Barlow Twins encoded vectors (method == bt_random_tiles):
python LeidenCom_accGPU_job_array.py \
--exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj \
--method bt_random_tiles \
--BT_proj bt_proj_for_leiden_search.csv \
--outdir LeidenCom \
--conf_files conf_job_array_random_bt.csv \
--D 128 \
--conf_nb $SLURM_ARRAY_TASK_IDOutput directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenCom.
To compute Leiden partitions for Barlow Twins encoded vectors associated with the highest attention scores (method == bt_highest_attn_scores_by_patient):
python LeidenCom_accGPU_job_array.py \
--exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj \
--method bt_highest_attn_scores_by_patient \
--BT_proj barlowTwinsLNEN_LCNEC_with_attn_scores_correct_pred_select_leiden_search.csv \
--outdir LeidenCom \
--conf_files conf_job_array_attn_scores_bt.csv \
--D 128 \
--conf_nb $SLURM_ARRAY_TASK_IDOutput directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj/LeidenCom.
The script AppLeidenAccordingKNN.py approximates Leiden partitions for additional tiles by leveraging the results from previously computed partitions.
proj: Matrix of encoded vectors for which partitions need to be approximated.conf: Configuration file containing WSI names for parallel processing.partitioning_train: Leiden partitions of the subset used in the initial computation.outfolder: Output directory.max_distance_comp: Number of tiles processed at once (controls memory usage).D: Dimension of encoded vectors.task_id: Index of the configuration file row to process ($SLURM_ARRAY_TASK_ID).
Approximate partitions with:
sbatch RunAppLeidenKNNJobArray.shAll approximations of Leiden partitions used for the random forest models that conclude the self-supervised branch are in:
/home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenComApproximated_75_res3_r1
The following methods and arguments are not used in the LungNENomics paper:
- Methods:
bt_highest_attn_scores_global,spca_random_tiles - Arguments:
exp_dir_roformer,folder_to_attn_maps,exp_dir_spca,SPCA_proj - Functions:
import_bt_data_global_attn_scores,import_spca_data