Skip to content

IARCbioinfo/LeidenForTilesCommunity_accGPU

Repository files navigation

GPU-Accelerated Leiden Community Detection

This repository provides tools for GPU-accelerated Leiden community detection using the RAPIDS package. It focuses on clustering encoded vectors from high-dimensional data, with specific scripts tailored for use on Jean-Zay.


Installation

Mandatory Packages

Set up the environment with the following commands:

conda create -n cuGraph_Leiden python=3.10
conda activate cuGraph_Leiden
conda install -c rapidsai -c nvidia -c conda-forge cudf cugraph cuml

Development Packages

Install additional tools for development and visualization:

conda install anaconda::jupyter
conda install anaconda::seaborn

Running Leiden Community Detection on Jean-Zay

Resources

Example SLURM Scripts

  • For V100 GPU:
    module load rapids/24.04
  • For A100 GPU:
    module load cpuarch/amd rapids/24.04

Computation of Leiden Partitions

The computation of Leiden partitions is managed by the script LeidenCom_accGPU_job_array.py, exploring various hyperparameters via job array submission. The script enables:

  1. Tile Importation: Import the projections to be clustered.
  2. Graph Creation: Build an undirected graph based on the K-nearest neighbors (KNN) of encoded vectors.
    Parameters:
    • D: Number of dimensions of the encoded vectors (Note: Encoded vectors must occupy the first D columns of the input matrix).
    • K: Number of neighbors (specified in the configuration file via --conf_files).
  3. Leiden Partitioning: Parameters:
    • res_leiden: Resolution for the Leiden algorithm.
    • max_iter: Maximum number of iterations.
  4. Metric Computation: Assess clustering quality.

Outputs

The main function saves:

  • Partitions: {outdir_exp}/partitions_{r}.csv
  • Centroids: {outdir_exp}/centroid_{r}.csv
  • Clustering Metrics: {outdir_exp}/clustering_scores_{r}.csv

Note: outdir_exp is a command-line argument.


Parallelization with SLURM Job Arrays

SLURM job arrays handle hyperparameter parallelization. Specify the configuration file path with --conf_files and the job ID with --conf_nb. The configuration file has the following format:

K res_leiden max_iter
0 75 0.05 100000
1 75 0.125 100000
2 75 0.25 100000
3 75 0.5 100000

Case Studies

Case 1: Random Subset Clustering (Self-Supervised Branch)

To compute Leiden partitions for a random subset of Barlow Twins encoded vectors (method == bt_random_tiles):

python LeidenCom_accGPU_job_array.py \
    --exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj \
    --method bt_random_tiles \
    --BT_proj bt_proj_for_leiden_search.csv \
    --outdir LeidenCom \
    --conf_files conf_job_array_random_bt.csv \
    --D 128 \
    --conf_nb $SLURM_ARRAY_TASK_ID

Output directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenCom.


Case 2: Clustering with Highest Attention Scores (Supervised Branch)

To compute Leiden partitions for Barlow Twins encoded vectors associated with the highest attention scores (method == bt_highest_attn_scores_by_patient):

python LeidenCom_accGPU_job_array.py \
    --exp_dir_bt /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj \
    --method bt_highest_attn_scores_by_patient \
    --BT_proj barlowTwinsLNEN_LCNEC_with_attn_scores_correct_pred_select_leiden_search.csv \
    --outdir LeidenCom \
    --conf_files conf_job_array_attn_scores_bt.csv \
    --D 128 \
    --conf_nb $SLURM_ARRAY_TASK_ID

Output directory: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_highest_attn_bt_proj/LeidenCom.


Approximation of Leiden Partitions Using KNN

The script AppLeidenAccordingKNN.py approximates Leiden partitions for additional tiles by leveraging the results from previously computed partitions.

Required Arguments

  • proj: Matrix of encoded vectors for which partitions need to be approximated.
  • conf: Configuration file containing WSI names for parallel processing.
  • partitioning_train: Leiden partitions of the subset used in the initial computation.
  • outfolder: Output directory.
  • max_distance_comp: Number of tiles processed at once (controls memory usage).
  • D: Dimension of encoded vectors.
  • task_id: Index of the configuration file row to process ($SLURM_ARRAY_TASK_ID).

Example SLURM Script

Approximate partitions with:

sbatch RunAppLeidenKNNJobArray.sh

All approximations of Leiden partitions used for the random forest models that conclude the self-supervised branch are in: /home/mathiane/LNENWork/LNEN_LCNEC_Molecular_barlow_twin/leinden_random_bt_proj/LeidenComApproximated_75_res3_r1


Deprecated Methods

The following methods and arguments are not used in the LungNENomics paper:

  • Methods: bt_highest_attn_scores_global, spca_random_tiles
  • Arguments: exp_dir_roformer, folder_to_attn_maps, exp_dir_spca, SPCA_proj
  • Functions: import_bt_data_global_attn_scores, import_spca_data

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published