Echo Slowdown Prediction Module

This repository contains the slowdown prediction module of Echo: Simulating Distributed Training at Scale. The module predicts the slowdown of GPU kernels when they overlap with other kernels during distributed training.

Project Overview

The Echo Slowdown Prediction Module is designed to predict the performance impact of kernel overlaps in distributed training scenarios. The system consists of three core components:

Kernel Metric Collection
- Utilizes NVIDIA Nsight Compute and Nsight Systems to profile GPU kernels
- Captures detailed execution metrics including:
  - Kernel duration
  - Memory bandwidth utilization
  - Compute throughput
  - Instruction mix statistics
- Generates baseline performance profiles for isolated kernel execution
- Outputs structured JSON files containing raw kernel metrics
Slowdown Collection
- Analyzes kernel behavior under various overlap scenarios
- Measures actual slowdown factors through controlled experiments
- Collects data on:
  - Resource contention patterns
  - Memory access interference
  - Compute unit saturation
- Generates ground truth data for model training and validation
Training & Testing
- Implements machine learning models for slowdown prediction
- Features include:
  - Multi-layer perceptron (MLP) regression
  - Gradient boosting decision trees
  - Feature importance analysis
- Provides comprehensive evaluation metrics:
  - Mean absolute percentage error (MAPE)
  - R-squared scores
  - Prediction error distribution
- Outputs trained models and prediction results for integration with the Echo simulator

Installation

Prerequisites

NVIDIA GPU with CUDA support
NVIDIA Nsight Compute CLI 2024.3.0.0 or later
NVIDIA Nsight Systems 2024.4.2.133 or later
Conda package manager

Setup Instructions

Clone git repository

git clone https://github.com/NetX-lab/Echo-slowdown.git
cd Echo-slowdown

Setup Conda environment

conda env create -f environment.yaml
conda activate simulator_echo

Configuration

Update the configuration by running the Python file:

python update_configs.py

This script will automatically detect the paths for nsys, python, and ncu using the which command and update the global_config.json files in the following directories:

kernel_metric/input/global_config.json
merge/input/global_config.json
slowdown_collection/input/global_config.json

Additionally, it will check the installed CUDA version using PyTorch and update the cuda_version_check field in the configuration files.

Our script is tested on NVIDIA Nsight Compute CLI 2024.3.0.0 and NVIDIA Nsight Systems 2024.4.2.133.

Usage

Basic Execution

Run the complete pipeline:

sh ./run_all.sh

Advanced Usage

You can run individual modules separately:

Collect kernel metrics:
```
cd kernel_metric
python main.py
```
Merge and preprocess data:
```
cd merge
python main.py
```
Train and evaluate slowdown prediction:
```
cd slowdown_collection
python main.py  
```

Output

The pipeline generates the following outputs:

echo_slowdown/
├── training_testing/
│   └── output/
│       ├── train_dataset.csv         # Processed training dataset that suits the input format of the XGBoost model
│       ├── xgb_model.json    # Trained XGBoost model
│       └── prediction/     # Model predictions
│           ├── output_df_merged_features.csv         # Simplified output of slowdown predictions
│           ├── output_full_df_merged_features.csv        # Detailed output of slowdown predictions
│           ├── output_metrics.txt         # Evaluation metrics such as MAE, MSE and RMSE
│           └── feature_importance_merged_features.png         # Plot of feature importance
├── kernel_metric/
│   └── output/
│       └── kernel_metric_output.csv    # Profiled kernel metrics
├── slowdown_collection/
│   └── output/
│       └── slowdown_stats_output_device_0.xlsx    # Profiled slowdown statistics
├── merge/
│   └── output/
│       └── merged_features.csv       # Combined kernel metrics and slowdown statistics, aligned by kernel name

Expected Results

After running the pipeline, you should expect:

Trained machine learning models for slowdown prediction
Prediction accuracy reports for different hardware configurations
Detailed performance metrics for overlapping kernel scenarios
Processed datasets ready for simulation integration

Troubleshooting

Nsight Tools Not Found
- Ensure Nsight Compute and Systems are properly installed
- Verify they are in your system PATH
CUDA Version Mismatch
- Check installed CUDA version matches your GPU driver
- Update PyTorch to compatible version if needed
Permission Errors
- Run scripts with appropriate permissions
- Ensure output directories are writable

Citation

If you use this module in your research, please cite our paper:

@article{echo2024,
  title={Echo: Simulating Distributed Training At Scale},
  author={Yicheng Feng, Yuetao Chen, Kaiwen Chen, Jingzong Li, Tianyuan Wu, Peng Cheng, Chuan Wu, Wei Wang, Tsung-Yi Ho, Hong Xu},
  journal={arXiv preprint arXiv:2412.12487},
  year={2024}
}

Contact

Please email Yicheng Feng (yichengfeng@link.cuhk.edu.hk) or Kin Hang Sew (ericskh@link.cuhk.edu.hk) if you have any questions.

License

This project is licensed under the MIT license - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Echo Slowdown Prediction Module

Project Overview

Installation

Prerequisites

Setup Instructions

Configuration

Usage

Basic Execution

Advanced Usage

Output

Expected Results

Troubleshooting

Citation

Contact

License

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
kernel_metric		kernel_metric
merge		merge
slowdown_collection		slowdown_collection
training_testing		training_testing
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
run_all.sh		run_all.sh
update_configs.py		update_configs.py

License

NetX-lab/Echo-slowdown

Folders and files

Latest commit

History

Repository files navigation

Echo Slowdown Prediction Module

Project Overview

Installation

Prerequisites

Setup Instructions

Configuration

Usage

Basic Execution

Advanced Usage

Output

Expected Results

Troubleshooting

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages