gpumon is a Python-based GPU monitoring tool that tracks the available memory on GPUs and sends email notifications when the free memory exceeds a specified threshold. This tool is particularly useful for machine learning and deep learning practitioners who need to optimize GPU usage in shared environments.
- Monitors GPU memory usage using
nvidia-smi - Sends email notifications when free memory exceeds a customizable threshold
- Configurable check intervals
- Logs GPU memory statistics to a log file
- Python 3.8 or higher
- NVIDIA GPUs with
nvidia-smiinstalled - SMTP server credentials for email notifications
- Clone this repository:
git clone https://github.com/yourusername/gpumon.git cd gpumon - Install the required Python packages:
pip install -r requirements.txt
-
Set up the environment variables for email notifications:
export SENDER_EMAIL=your_email@gmail.com export RECEIVER_EMAIL=receiver_email@gmail.com export EMAIL_PASSWORD=your_password
-
Adjust the configuration in
gpumon.pyas needed:MEMORY_THRESHOLD: The free memory threshold for notifications (in MiB). Default is 20480 (20GB).CHECK_INTERVAL: The interval between checks (in seconds). Default is 60 seconds.TIMEZONE: The timezone for logging timestamps. Default isAsia/Taipei.
-
Run the script:
python gpumon.py
Logs are stored in gpu_monitor.log and include:
- Timestamp
- GPU memory statistics
- Notification events
- Error:
nvidia-smicommand not found: Ensure that NVIDIA drivers and CUDA Toolkit are installed, andnvidia-smiis in your system's PATH. - Email notification not working: Verify that the sender's email has SMTP access enabled. For Gmail, you may need to use an App Password.
- Environment variables not set: Make sure
SENDER_EMAIL,RECEIVER_EMAIL, andEMAIL_PASSWORDare correctly configured in your environment.
Contributions are welcome! Feel free to open an issue or submit a pull request to improve this project.
This project is licensed under the MIT License. See the LICENSE file for details.