GPU-Accelerated Leiden Community Detection

This repository provides tools for GPU-accelerated Leiden community detection using the RAPIDS package. It focuses on clustering encoded vectors from high-dimensional data, with specific scripts tailored for use on Jean-Zay.

Installation

Mandatory Packages

Set up the environment with the following commands:

conda create -n cuGraph_Leiden python=3.10
conda activate cuGraph_Leiden
conda install -c rapidsai -c nvidia -c conda-forge cudf cugraph cuml

Development Packages

Install additional tools for development and visualization:

conda install anaconda::jupyter
conda install anaconda::seaborn

Running Leiden Community Detection on Jean-Zay

Resources

RAPIDS AI Documentation for Jean-Zay

Example SLURM Scripts

For V100 GPU:
```
module load rapids/24.04
```
For A100 GPU:
```
module load cpuarch/amd rapids/24.04
```

Computation of Leiden Partitions

The computation of Leiden partitions is managed by the script LeidenCom_accGPU_job_array.py, exploring various hyperparameters via job array submission. The script enables:

Tile Importation: Import the projections to be clustered.
Graph Creation: Build an undirected graph based on the K-nearest neighbors (KNN) of encoded vectors.
Parameters:
- D: Number of dimensions of the encoded vectors (Note: Encoded vectors must occupy the first D columns of the input matrix).
- K: Number of neighbors (specified in the configuration file via --conf_files).
Leiden Partitioning: Parameters:
- res_leiden: Resolution for the Leiden algorithm.
- max_iter: Maximum number of iterations.
Metric Computation: Assess clustering quality.

Outputs

The main function saves:

Partitions: {outdir_exp}/partitions_{r}.csv
Centroids: {outdir_exp}/centroid_{r}.csv
Clustering Metrics: {outdir_exp}/clustering_scores_{r}.csv

Note: outdir_exp is a command-line argument.

Parallelization with SLURM Job Arrays

SLURM job arrays handle hyperparameter parallelization. Specify the configuration file path with --conf_files and the job ID with --conf_nb. The configuration file has the following format:

	K	res_leiden	max_iter
0	75	0.05	100000
1	75	0.125	100000
2	75	0.25	100000
3	75	0.5	100000

Case Studies

Case 1: Random Subset Clustering (Self-Supervised Branch)

To compute Leiden partitions for a random subset of Barlow Twins encoded vectors (method == bt_random_tiles):

python LeidenCom_accGPU_job_array.py \
    --exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj \
    --method bt_random_tiles \
    --BT_proj bt_proj_for_leiden_search.csv \
    --outdir LeidenCom \
    --conf_files conf_job_array_random_bt.csv \
    --D 128 \
    --conf_nb $SLURM_ARRAY_TASK_ID

Output directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenCom.

Case 2: Clustering with Highest Attention Scores (Supervised Branch)

To compute Leiden partitions for Barlow Twins encoded vectors associated with the highest attention scores (method == bt_highest_attn_scores_by_patient):

python LeidenCom_accGPU_job_array.py \
    --exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj \
    --method bt_highest_attn_scores_by_patient \
    --BT_proj barlowTwinsLNEN_LCNEC_with_attn_scores_correct_pred_select_leiden_search.csv \
    --outdir LeidenCom \
    --conf_files conf_job_array_attn_scores_bt.csv \
    --D 128 \
    --conf_nb $SLURM_ARRAY_TASK_ID

Output directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj/LeidenCom.

Approximation of Leiden Partitions Using KNN

The script AppLeidenAccordingKNN.py approximates Leiden partitions for additional tiles by leveraging the results from previously computed partitions.

Required Arguments

proj: Matrix of encoded vectors for which partitions need to be approximated.
conf: Configuration file containing WSI names for parallel processing.
partitioning_train: Leiden partitions of the subset used in the initial computation.
outfolder: Output directory.
max_distance_comp: Number of tiles processed at once (controls memory usage).
D: Dimension of encoded vectors.
task_id: Index of the configuration file row to process ($SLURM_ARRAY_TASK_ID).

Example SLURM Script

Approximate partitions with:

sbatch RunAppLeidenKNNJobArray.sh

All approximations of Leiden partitions used for the random forest models that conclude the self-supervised branch are in: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenComApproximated_75_res3_r1

Deprecated Methods

The following methods and arguments are not used in the LungNENomics paper:

Methods: bt_highest_attn_scores_global, spca_random_tiles
Arguments: exp_dir_roformer, folder_to_attn_maps, exp_dir_spca, SPCA_proj
Functions: import_bt_data_global_attn_scores, import_spca_data

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
misc		misc
AppLeidenAccordingKNN.py		AppLeidenAccordingKNN.py
LeidenCom_accGPU_job_array.py		LeidenCom_accGPU_job_array.py
README.md		README.md
RunAppLeidenKNNJobArray.sh		RunAppLeidenKNNJobArray.sh
RunAppLeidenKNNJobArray_BTHighAttnScore_osiris.sh		RunAppLeidenKNNJobArray_BTHighAttnScore_osiris.sh
Run_LeidenA100_jz_job_aray_highest_attn_bt_proj.sh		Run_LeidenA100_jz_job_aray_highest_attn_bt_proj.sh
Run_LeidenA100_jz_job_aray_random_bt_proj.sh		Run_LeidenA100_jz_job_aray_random_bt_proj.sh
conf_job_array_attn_scores_bt.csv		conf_job_array_attn_scores_bt.csv
conf_job_array_random_bt.csv		conf_job_array_random_bt.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU-Accelerated Leiden Community Detection

Installation

Mandatory Packages

Development Packages

Running Leiden Community Detection on Jean-Zay

Resources

Example SLURM Scripts

Computation of Leiden Partitions

Outputs

Parallelization with SLURM Job Arrays

Case Studies

Case 1: Random Subset Clustering (Self-Supervised Branch)

Case 2: Clustering with Highest Attention Scores (Supervised Branch)

Approximation of Leiden Partitions Using KNN

Required Arguments

Example SLURM Script

Deprecated Methods

References

About

Uh oh!

Releases

Packages

Languages

IARCbioinfo/LeidenForTilesCommunity_accGPU

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Leiden Community Detection

Installation

Mandatory Packages

Development Packages

Running Leiden Community Detection on Jean-Zay

Resources

Example SLURM Scripts

Computation of Leiden Partitions

Outputs

Parallelization with SLURM Job Arrays

Case Studies

Case 1: Random Subset Clustering (Self-Supervised Branch)

Case 2: Clustering with Highest Attention Scores (Supervised Branch)

Approximation of Leiden Partitions Using KNN

Required Arguments

Example SLURM Script

Deprecated Methods

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages