Selective slowdown profiler for Python and native code.
Speed Bump introduces controlled, selective delays into code execution. By slowing specific modules/functions and measuring throughput impact, you can identify which code paths actually matter to overall system performance.
- Python code: Uses PEP 669 (
sys.monitoring) orsys.setprofileto intercept function calls - Native code: Uses kernel uprobes via the optional speed-bump-native-kmod kernel module
This is particularly useful for AI/ML workloads where traditional profiling misses the subtle interactions between Python, native extensions, and GPU execution.
Traditional profilers show where Python spends time, but in GPU-accelerated systems this is misleading. If Python is busy while the GPU is also busy, speeding up Python won't help. Conversely, micro-stalls where the GPU waits for Python won't show up as "hot" in a profiler.
The fundamental issue: time spent ≠ time that matters.
Speed Bump inverts the problem: instead of measuring how fast code runs, measure how much throughput drops when code is artificially slowed. If slowing module X doesn't affect throughput, don't bother optimising it.
From source:
git clone https://github.com/SonicField/speed-bump.git
cd speed-bump
pip install .From wheel (if available):
pip install speed_bump-0.1.0-cp312-cp312-linux_aarch64.whlManual build (without pip/setuptools):
If pip or setuptools are unavailable, build the C extension directly:
cd speed-bump
PYTHON_INCLUDES=$(python3 -c 'import sysconfig; print(sysconfig.get_config_var("INCLUDEPY"))')
EXT_SUFFIX=$(python3 -c 'import sysconfig; print(sysconfig.get_config_var("EXT_SUFFIX"))')
# Core extension (required for all Python versions)
gcc -shared -fPIC -I"$PYTHON_INCLUDES" -O3 -Wall -std=c11 -D_GNU_SOURCE \
src/speed_bump/_core.c -o src/speed_bump/_core$EXT_SUFFIX
# Setprofile extension (required for Python 3.10-3.11)
gcc -shared -fPIC -I"$PYTHON_INCLUDES" -O3 -Wall -std=c11 -D_GNU_SOURCE \
src/speed_bump/_setprofile.c -o src/speed_bump/_setprofile$EXT_SUFFIX
# Then add src/ to PYTHONPATH
export PYTHONPATH=$PWD/src:$PYTHONPATH
python3 -c "import speed_bump; print(speed_bump.clock_overhead_ns)"Requirements:
- Linux (x86_64 or aarch64)
- Python 3.10+
Speed Bump supports Python 3.10 and later with different backends:
| Python Version | Backend | Notes |
|---|---|---|
| 3.12+ | PEP 669 (sys.monitoring) |
Full feature support |
| 3.10-3.11 | sys.setprofile (C extension) |
clear_cache() is a no-op |
Python 3.10-3.11 Limitations:
- The match cache is stored in code objects'
co_extrafield and cannot be cleared clear_cache()has no effect - cache persists for the lifetime of the process- Qualified name construction is approximate (uses first argument type for methods)
- Use a fresh Python process if you need to change target patterns
- Create a targets file specifying what to slow:
# targets.txt
transformers.modeling_llama:LlamaAttention.*
vllm.worker.model_runner:ModelRunner.execute_model
- Run your application with Speed Bump:
export SPEED_BUMP_TARGETS=/path/to/targets.txt
export SPEED_BUMP_DELAY_NS=10000 # 10µs delay per call
export SPEED_BUMP_START_MS=5000 # Start after 5s warmup
export SPEED_BUMP_DURATION_MS=30000 # Run for 30s
python your_benchmark.py- Compare throughput with and without
SPEED_BUMP_TARGETSset.
| Environment Variable | Description | Default |
|---|---|---|
SPEED_BUMP_TARGETS |
Path to targets file | (disabled) |
SPEED_BUMP_DELAY_NS |
Delay in nanoseconds per trigger | 1000 |
SPEED_BUMP_FREQUENCY |
Trigger every Nth matching call | 1 |
SPEED_BUMP_START_MS |
Milliseconds after process start | 0 |
SPEED_BUMP_DURATION_MS |
Duration in milliseconds (0 = indefinite) | 0 |
# Comments start with #
# Format: module_glob:qualified_name_glob
# Match all methods of a class
transformers.modeling_llama:LlamaAttention.*
# Match a specific function
vllm.worker.model_runner:ModelRunner.execute_model
# Match everything in a module
mypackage.slow_module:*
# Wildcard module matching
transformers.*:*Attention*
Speed Bump uses Python's monitoring capabilities to register low-overhead callbacks on function calls:
- Python 3.12+: Uses PEP 669 (
sys.monitoring) for per-code-object monitoring with zero overhead for non-matching functions - Python 3.10-3.11: Uses
sys.setprofilevia a C extension, with match results cached in code objects
When a matching function is called during the active time window, Speed Bump executes a spin-delay loop to introduce the configured latency.
Key design decisions:
- Spin delay, not sleep: Delays hold the CPU (and GIL) to accurately simulate slower Python code
- Clock calibration: Measures
clock_gettimeoverhead at startup to ensure accurate delays - Per-code caching: Match results are cached per code object to minimise overhead
Speed Bump has fundamental constraints to be aware of:
Speed Bump's Python monitoring can only slow down Python code that runs through the interpreter. C extensions, NumPy ufuncs, and other native code execute outside Python's monitoring system.
For native code, use the speed_bump.native module (see Native Code Probing below), which uses kernel uprobes to inject delays into compiled binaries.
The spin delay holds the GIL while waiting. This accurately simulates slower Python code (which would also hold the GIL), but means:
- Other Python threads are blocked during the delay
- In multi-threaded applications, interpret results carefully
Verified with Python 3.14 free-threaded build (2026-02-01).
Speed Bump works correctly with free-threaded Python (--disable-gil builds):
- The C extension declares
Py_mod_gil = Py_MOD_GIL_NOT_USED, so it runs without re-enabling the GIL - Each thread receives accurate per-thread delays
- The spin_delay_ns function is thread-safe
- Parallel execution completes in constant wall-clock time regardless of thread count
Test Results (Python 3.14.0 FTP build, LTO):
- Runtime detection: PASS (correctly identifies FTP vs GIL)
- Per-thread delay accuracy: PASS (each thread gets correct delay)
- Parallel performance: PASS (N threads complete in ~1× delay time, not N×)
- Cache thread safety: PASS
- Methodology Guide: The systematic approach to finding Python bottlenecks
- Pattern Reference: How to write target patterns for different frameworks
import speed_bump
# Calibration results
speed_bump.clock_overhead_ns # Measured clock_gettime overhead
speed_bump.min_delay_ns # Minimum achievable delay (2x overhead)
# Low-level delay function (for testing)
speed_bump.spin_delay_ns(1000) # Spin for 1µsThe speed_bump.native module provides uprobe-based delays for native C functions, allowing you to measure sensitivity of compiled code (C extensions, CPython internals, system libraries).
Requirements:
- Linux with kernel uprobe support
- The
speed-bump-native-kmodkernel module loaded - Root privileges (or appropriate capabilities) for writing to sysfs
from speed_bump import native
# Probe a CPython internal function for this process and its children
with native.probe("/usr/bin/python3", "PyObject_GetAttr", delay_ns=1000):
run_benchmark() # Only this process tree is affectedfrom speed_bump import native
# Check if kernel module is available
if native.is_available():
# Context manager for scoped probing
with native.probe(binary_path, symbol, delay_ns=1000, pid=None):
# pid defaults to current process (os.getpid())
# Probe is automatically removed on exit
pass
# Manual control
native.add_probe("/path/to/binary", "function_name", delay_ns=5000)
native.remove_probe("/path/to/binary", "function_name")The native module writes to /sys/kernel/speed_bump/targets to configure the kernel module:
- Add probe:
+/path/to/binary:symbol delay_ns pid=N - Remove probe:
-/path/to/binary:symbol
The kernel module uses uprobes to inject delays when the specified function is called. PID filtering ensures only the specified process and its descendants are affected.
Use standard tools to find symbols in binaries:
# List symbols in Python
nm -D /usr/bin/python3 | grep PyObject
# List symbols in a shared library
nm -D /usr/lib/libcuda.so | grep cudaLaunchThe kernel module source is available at speed-bump-native-kmod.
To install:
- Clone the repository:
git clone https://github.com/SonicField/speed-bump-native-kmod.git - Follow the README for building and loading (or use the QEMU test environment)
- Once loaded,
/sys/kernel/speed_bump/targetsbecomes available
Without the kernel module, speed-bump works normally for Python code profiling - only the speed_bump.native module will report is_available() == False.
git clone https://github.com/SonicField/speed-bump.git
cd speed-bump
pip install -e .[dev]
pytestFor C-level sanitiser tests (ThreadSanitizer, AddressSanitizer), see docs/testing.md.
See CONTRIBUTING.md for more details.
MIT. See LICENSE.
v0.1.0 - Core functionality complete:
- Clock calibration
- Spin delay (C extension)
- Target pattern parsing (glob-based)
- PEP 669 monitoring integration
- Python 3.10+ support via sys.setprofile backend
- Timing window control (start delay, duration)
- Frequency control (every Nth call)
- Native code probing via kernel uprobes
- Statistics collection