This repository contains the slowdown prediction module of Echo: Simulating Distributed Training at Scale. The module predicts the slowdown of GPU kernels when they overlap with other kernels during distributed training.
The Echo Slowdown Prediction Module is designed to predict the performance impact of kernel overlaps in distributed training scenarios. The system consists of three core components:
-
Kernel Metric Collection
- Utilizes NVIDIA Nsight Compute and Nsight Systems to profile GPU kernels
- Captures detailed execution metrics including:
- Kernel duration
- Memory bandwidth utilization
- Compute throughput
- Instruction mix statistics
- Generates baseline performance profiles for isolated kernel execution
- Outputs structured JSON files containing raw kernel metrics
-
Slowdown Collection
- Analyzes kernel behavior under various overlap scenarios
- Measures actual slowdown factors through controlled experiments
- Collects data on:
- Resource contention patterns
- Memory access interference
- Compute unit saturation
- Generates ground truth data for model training and validation
-
Training & Testing
- Implements machine learning models for slowdown prediction
- Features include:
- Multi-layer perceptron (MLP) regression
- Gradient boosting decision trees
- Feature importance analysis
- Provides comprehensive evaluation metrics:
- Mean absolute percentage error (MAPE)
- R-squared scores
- Prediction error distribution
- Outputs trained models and prediction results for integration with the Echo simulator
- NVIDIA GPU with CUDA support
- NVIDIA Nsight Compute CLI 2024.3.0.0 or later
- NVIDIA Nsight Systems 2024.4.2.133 or later
- Conda package manager
-
Clone git repository
git clone https://github.com/NetX-lab/Echo-slowdown.git cd Echo-slowdown -
Setup Conda environment
conda env create -f environment.yaml conda activate simulator_echo
Update the configuration by running the Python file:
python update_configs.pyThis script will automatically detect the paths for nsys, python, and ncu using the which command and update the global_config.json files in the following directories:
kernel_metric/input/global_config.jsonmerge/input/global_config.jsonslowdown_collection/input/global_config.json
Additionally, it will check the installed CUDA version using PyTorch and update the cuda_version_check field in the configuration files.
Our script is tested on NVIDIA Nsight Compute CLI 2024.3.0.0 and NVIDIA Nsight Systems 2024.4.2.133.
Run the complete pipeline:
sh ./run_all.shYou can run individual modules separately:
-
Collect kernel metrics:
cd kernel_metric python main.py -
Merge and preprocess data:
cd merge python main.py -
Train and evaluate slowdown prediction:
cd slowdown_collection python main.py
The pipeline generates the following outputs:
echo_slowdown/
├── training_testing/
│ └── output/
│ ├── train_dataset.csv # Processed training dataset that suits the input format of the XGBoost model
│ ├── xgb_model.json # Trained XGBoost model
│ └── prediction/ # Model predictions
│ ├── output_df_merged_features.csv # Simplified output of slowdown predictions
│ ├── output_full_df_merged_features.csv # Detailed output of slowdown predictions
│ ├── output_metrics.txt # Evaluation metrics such as MAE, MSE and RMSE
│ └── feature_importance_merged_features.png # Plot of feature importance
├── kernel_metric/
│ └── output/
│ └── kernel_metric_output.csv # Profiled kernel metrics
├── slowdown_collection/
│ └── output/
│ └── slowdown_stats_output_device_0.xlsx # Profiled slowdown statistics
├── merge/
│ └── output/
│ └── merged_features.csv # Combined kernel metrics and slowdown statistics, aligned by kernel name
After running the pipeline, you should expect:
- Trained machine learning models for slowdown prediction
- Prediction accuracy reports for different hardware configurations
- Detailed performance metrics for overlapping kernel scenarios
- Processed datasets ready for simulation integration
-
Nsight Tools Not Found
- Ensure Nsight Compute and Systems are properly installed
- Verify they are in your system PATH
-
CUDA Version Mismatch
- Check installed CUDA version matches your GPU driver
- Update PyTorch to compatible version if needed
-
Permission Errors
- Run scripts with appropriate permissions
- Ensure output directories are writable
If you use this module in your research, please cite our paper:
@article{echo2024,
title={Echo: Simulating Distributed Training At Scale},
author={Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu},
journal={arXiv preprint arXiv:2412.12487},
year={2024}
}Please email Yicheng Feng (yichengfeng@link.cuhk.edu.hk) or Kin Hang Sew (ericskh@link.cuhk.edu.hk) if you have any questions.
This project is licensed under the MIT license - see the LICENSE file for details.