-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Currently gProfiler has high error rate during Profiling
- Shorted lived process
Issue: Profilers attempting to profile processes that exit during profiling
Impact: Multiple errors/day from rbspy, py-spy failing on transient processes
Root Cause: Race conditions with process lifecycle
Recommended Solution: Implemented "Smart Skipping Logic"
Skip processes younger than min_duration seconds
Enhanced error handling for processes that exit during profiling
Applied across Ruby, Java, and Python profilers
- False positive Process identification
Issue: Profilers targeting non-target processes with embedded libraries
Examples: PySpY trying to profile Envoy servers, register services with embedded Python
Impact: 200+ false positive errors/day
Recommended Solution: Dynamic detection with graceful fallback
Implemented _is_embedded_python_process() detection
Added _is_likely_python_interpreter() validation
Changed error logging from ERROR to INFO for operational clarity
- Process with Deleted Libraries (ELF Symbol Errors)
Issue: PyPerf crashes when profiling processes with deleted libraries (Bazel, containers)
Impact: multiple crashes/day with verbose ELF symbol error logs
Error Pattern: "Failed to iterate over ELF symbols: ... (deleted)"
Root Cause: Containerized processes and Bazel builds create temporary libraries that get deleted
Recommended Solution: Reactive error handling approach
Enhanced _process_pyperf_stderr() with graceful error detection
Added user-friendly error messages
Filters verbose debug output while maintaining profiling coverage
- GPU Machine Segmentation Faults - CPU profiling on GPU machines #992
Issue: perf segfaults on GPU machines during symbol resolution
Impact: multiple segfaults/day causing profiler restarts
Root Cause: GPU driver interactions with perf symbol resolution
Recommended Solution: Enhanced segfault detection and graceful recovery. However, this cause perf returning 0 samples and will break run time profilers during merge logic. So we also need to fix the merge logic.
Also in next phase, fix the root cause of updating perf version.
- Invalid PID on perf crashes the entire gProfiler
Issue : gProfiler stops abruptly when PIDs are passed and atleast one PID is invalid due to perf behavior.
Expectation : We should still be able to continuously profile for rest valid pids rather than stopping the gProfiler abruptly
Root Cause:: gProfiler crashes when perf target PIDs were invalid during initialization
Recommended Solution:: Graceful PID validation and fallback mechanisms in gprofiler/profilers/factory.py