-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Your GPUs are busy but inference is still slow
- Adding more nodes sometimes makes performance worse
- The same experiment gives different results
- No profiler shows a clear bottleneck
- Latency spikes appear without configuration changes
If these look familiar, the issue may not be compute utilization.
It may be execution behavior.
Argus is an execution observation protocol for distributed AI systems.
It does not optimize models and does not benchmark hardware.
Argus helps you check whether a system behaves consistently when the same workload is repeated.
In distributed AI workloads, systems often appear healthy:
- GPUs are utilized
- CPU usage looks normal
- Memory is sufficient
- No crashes occur
Yet performance becomes inconsistent.
You may observe:
- latency spikes without configuration changes
- scaling more nodes making performance worse
- experiments that cannot be reproduced
- intermittent retries or timeouts
- tail latency dominating total runtime
Traditional monitoring tools measure resource usage.
Argus observes execution behavior.
Instead of a single performance number, Argus generates a reproducible observation record:
- execution metrics (
metrics.json) - structured report (
report.md) - environment metadata (
run_meta.json)
The goal is not to prove performance improvement.
The goal is to determine whether execution behavior is stable or unstable across repeated runs.
Argus may be useful if:
- the system is slow but utilization is high
- scaling nodes does not improve performance
- identical runs produce different results
- debugging shows no obvious bottleneck
- performance issues are intermittent rather than constant
Argus is not a profiler and not a monitoring dashboard.
It is a protocol to observe whether execution behavior itself is reproducible.
Argus does not:
- optimize performance
- accelerate models
- replace profilers
- evaluate model quality
Argus only records and reports execution behavior.
- See When to use Argus
- See What Argus measures
- See How to read the report
- See Limitations
First time here? Start with When to use Argus.
Repository: https://github.com/tongro2025/Argus
Start here
Concept
Protocol
Project