TPUTOP is a TPU monitoring tool based on NVTOP, providing htop-like task monitoring for Google Cloud TPU Pods. It supports monitoring multiple TPU devices across all workers in a TPU Pod, displaying real-time utilization, memory usage, and process information.
Key Features:
- Monitor all TPU devices in a TPU Pod (local + remote workers)
- Auto-discovery of TPU Pod workers via GCP metadata service
- Real-time TPU utilization and memory monitoring
- Process information display (PID, USER, CPU, Memory, Command)
- 20fps refresh rate for real-time monitoring
tputop| Command | Description |
|---|---|
tputop |
Default: Monitor local TPU + remote workers (auto-discovery via GCP metadata, fallback to NVTOP_TPU_POD_FILE) |
tputop --local |
Monitor local TPU only |
tputop --podips |
Monitor TPUs from ~/podips.txt only (no local TPU) |
tputop --podips=name |
Monitor TPUs from ~/name.txt (no local TPU) |
tputop --podips=/path/to/file |
Monitor TPUs from specified path (no local TPU) |
The podips file should contain one IP address per line:
10.130.0.25
10.130.0.26
10.130.0.27
TPUTOP automatically discovers all workers in a TPU Pod using the GCP metadata service. It reads the worker-network-endpoints attribute to find all worker IPs, filters out the local worker, and connects to remote workers via SSH to collect TPU metrics.
- Google Cloud TPU VM
libtpuinfo.solibrary (included with TPU runtime)- SSH access to other workers in the Pod (passwordless)
- Python 3 on all workers
# Install build dependencies
sudo apt install -y libdrm-dev libsystemd-dev libudev-dev cmake libncurses5-dev libncursesw5-dev git
# Install libtpuinfo
wget https://github.com/rdyro/libtpuinfo/releases/download/v0.0.1/libtpuinfo-linux-x86_64.so
sudo mv libtpuinfo-linux-x86_64.so /lib/libtpuinfo.so
# Clone and build
git clone https://github.com/hainuo-wang/tputop.git
cd tputop && mkdir build && cd build
cmake -DTPU_SUPPORT=ON ..
make
# Install
sudo make installFor TPU pods with multiple workers, other nodes only need libtpuinfo installed. The main node will connect via SSH to collect TPU metrics.
# Install libtpuinfo only (no need to install tputop)
wget https://github.com/rdyro/libtpuinfo/releases/download/v0.0.1/libtpuinfo-linux-x86_64.so
sudo mv libtpuinfo-linux-x86_64.so /lib/libtpuinfo.soFor each TPU device:
- Device Name: TPU model and device ID (e.g., "TPU v4 [0@10.130.0.25]")
- TPU Utilization: Duty cycle percentage
- Memory Usage: Used / Total HBM memory
| Column | Description |
|---|---|
| PID | Process ID |
| USER | Process owner |
| DEV | TPU device index |
| TYPE | Process type (Compute) |
| TPU | TPU utilization % |
| TPU MEM | TPU memory usage |
| CPU | CPU usage % |
| HOST MEM | Host memory usage |
| Command | Process command line |
TPUTOP is based on NVTOP and is licensed under GPLv3.
- nvtop - The original GPU monitoring tool this project is based on.
- Google TPU Research Cloud (TRC) - For providing TPU resources for development and testing.
