PICO β Performance Insights for Collective Operations
π« If you find PICO useful for your research or benchmarking work, please consider giving it a β on GitHub!
PICO is a lightweight, extensible, and reproducible benchmarking suite for evaluating and tuning collective communication operations across diverse libraries and hardware platforms.
Built for researchers, developers, and system administrators, PICO streamlines the entire benchmarking workflowβfrom configuration to execution, tracing, and analysisβacross MPI, NCCL, and user-defined collectives.
- π¦ Unified micro-benchmarking of both CPU and GPU collectives, across a variety of MPI libraries (Open MPI, MPICH, Cray MPICH), NCCL and user-defined algorithms.
- ποΈ Guided configuration via a fully fledged Textual TUI or CLI-driven JSON/flag workflow with per-site presets.
- π Reproducible runs through environment capture, metadata logging, and timestamped result directories.
- π§© Built-in correctness checks for custom collectives and automatic ground-truth validation.
- π§ Per-phase instrumentation, going beyond micro-benchmarking, hence the name PICO
- π§΅ Queue-friendly orchestration that compiles, ships, and archives jobs seamlessly on SLURM clusters or in local mode for debugging.
- π Bundled plotting, tracing, and scheduling utilities for streamlined post-processing and algorithm engineering.
π Configuration
ββ π§© Sources: Textual TUI β’ JSON β’ CLI flags
ββ βοΈ Validation & module loading via submit_wrapper.sh
π Orchestration
ββ π§΅ scripts/orchestrator.sh iterates over:
β β’ Libraries Γ Collectives Γ Message Sizes
ββ ποΈ Builds binaries and dispatches jobs (SLURM or local)
π§ Execution
ββ pico_core / libpico executables
ββ β
Correctness checks
ββ π§ Optional per-phase instrumentation
π Results
ββ results/<system>/<timestamp>/
β β’ CSV metrics
β β’ Logs
β β’ Metadata
β β’ Archives
ββ Post-processing utilities:
β’ plot/ β’ tracer/ β’ schedgen/
The recommended way to use PICO is through its Textual TUI, which guides you from configuration to job submission.
Ensure you have at least one valid environment definition under config/environment/.
A working local sample is provided, modify it for your local machine.
For remote clusters, you should mirror one of the existing environment templates and adapt it to your site (a setup wizard to simplify this configuration is on its way!)
Create and activate a Python virtual environment, then install the Python dependencies used by the TUI and analysis tools:
pip install -r requirements.txtStart the interactive interface follow the four-step wizard: configure environment, select libraries, choose algorithms, and export.
python tui/main.pyWithin the TUI, define:
- The target collective(s)
- Message sizes and iteration counts
- Backends (MPI / NCCL / custom)
- Instrumentation and validation settings
The TUI will produce a test descriptor file encapsulating all these options.
The export lands in tests/<name>.json (full configuration) and tests/<name>.sh (shell exports).
Execute the generated descriptor using the wrapper script, which handles compilation, dispatch, and archival:
scripts/submit_wrapper.sh -f [path_to_test_sh_file]This command will orchestrate the full benchmarking workflow β locally or on SLURM clusters β using your defined environment.
You can still invoke PICO directly via the CLI to explore options or run ad-hoc tests. If that is desired, after step 1 do:
scripts/submit_wrapper.sh --help
β οΈ Note: The CLI path is currently partially maintained; some flags may be deprecated as functionality transitions to the TUI.
Example CLI invocation:
scripts/submit_wrapper.sh \
--location leonardo \
--nodes 8 \
--ntasks-per-node 32 \
--collectives allreduce,allgather \
--types int32,double \
--sizes 64,1024,65536 \
--segment-sizes 0 \
--time 01:00:00 \
--gpu-awareness no- Provide comma-separated lists for datatypes, message sizes, and segment sizes.
- Use
--gpu-awareness yesand--gpu-per-nodeto benchmark NCCL or CUDA-aware MPI collectives. - Pass
--debug yesfor quick validation runs with reduced iterations and debug builds. - When
--compile-only yesis set, the script stops after buildingbin/pico_coreand its GPU counterpart.
- A C/C++ compiler and MPI implementation (Open MPI, MPICH, or Cray MPICH). CUDA-aware MPI or NCCL is optional for GPU runs.
- (Optional) CUDA toolkit and a compatible NCCL build for GPU collectives.
- Python 3.9+ with
pipfor the TUI and analysis utilities (pip install -r requirements.txt). - SLURM for cluster submissions; local mode is supported for functional testing.
- Basic build tools (
make) and a Bash-compatible shell.
pico_core/β C benchmarking driver that allocates buffers, times collectives, checks results, and writes output.libpico/β Library of custom collective algorithms and instrumentation helpers, selectable alongside vendor MPI/NCCL paths.scripts/submit_wrapper.shβ Entry point that parses CLI flags or TUI exports, loads site modules, builds binaries, activates Python envs, and launches SLURM or local runs.scripts/orchestrator.shβ Node-side runner that sweeps libraries, algorithm sets, GPU modes, message sizes, and datatypes while invoking metadata capture and optional compression.config/β Declarative environment, library, and algorithm descriptions consumed by the TUI and CLI (modules to load, compiler wrappers, task/GPU limits).tui/β Textual-based UI that guides the user through environment selection, library selection, algorithm mix, and exports the shell/JSON bundle for later submission.plot/β Python package and CLI (python -m plot β¦) that turns CSV summaries into line charts, bar charts, heatmaps, and tables.tracer/β Tools for network-awareness studies (link utilization estimates, cluster job monitoring, scatterplots/boxplots).schedgen/β Adapted SPCL scheduler generator used to derive algorithm schedules from communication traces.results/β Storage for raw outputs, metadata CSVs (per system), and helper scripts such asgenerate_metadata.py.
- Environment sourcing loads modules, compiler wrappers, MPI/NCCL paths, and queue defaults from
config/environments/<location>.sh. - The Makefile builds
libpicofirst, thenpico_core(CPU) and optionallypico_core_cuda(GPU), honouring debug and instrumentation flags. - A Python virtual environment is activated and populated with plotting/tracing dependencies on demand.
scripts/orchestrator.shiterates over every selected library, collective, datatype, message size, and GPU mode. For each combination it:- Prepares per-collective environment variables and propagates algorithm lists to the workers.
- Generates metadata entries through
results/generate_metadata.py, capturing cluster, job, library, GPU, and note fields. - Runs
pico_core, which allocates buffers, initializes randomized inputs (deterministic when debugging), executes warmups, measures iterations, and compares the outcome against vendor MPI results. - Optionally enables LibPICO instrumentation tags to time internal algorithm phases.
- Outputs are written under
results/<location>/<timestamp>/; in non-debug runs the directory can be tarred and optionally deleted.
- CSV files follow the
<count>_<algorithm>[_<segment>]_datatype.csvnaming scheme with per-iteration timing, statistics-only, or summarized rows depending on--output-level. - Allocation maps (
alloc_<tasks>.csv) record rank-to-node placement. GPU runs append_GPU. - SLURM logs reside alongside the CSVs (
slurm_<jobid>.out/.err) unless in debug mode. - Metadata is appended to
results/<location>_metadata.csv, enabling cross-run filtering by timestamp, collective, library version, GPU involvement, and notes. - Example plotting commands:
python -m plot summary --summary-file results/leonardo/<timestamp>/summary.csv
python -m plot heatmap --system leonardo --timestamp <timestamp> --collective allreduce
python -m plot boxplot --system lumi --notes "production"- The tracer package (
tracer/trace_communications.py) estimates traffic on global links for recorded allocations, whiletracer/sinfocan processes week-long job snapshots from monitored clusters.
- Building with
-DPICO_INSTRUMENTexposes thePICO_TAG_BEGIN/ENDmacros defined ininclude/libpico.h.- These can be inserted into LibPICO collective implementations to record per-phase timings, which are emitted into
_instrument.csvfiles. Detailed usage and examples are provided in libpico/instrument.md. - Instrumentation is supported for CPU collectives; the macros are transparent when GPU paths are enabled.
- These can be inserted into LibPICO collective implementations to record per-phase timings, which are emitted into
- To add new algorithms, implement them in
libpico_<collective>.c, declare them ininclude/libpico.h, and list them inconfig/algorithms/<standard>/<library>/<collective>.json. The TUI and CLI automatically surface the new options.
- Environments: Add new cluster profiles by cloning
config/environment/<env>JSON descriptors and creating a matchingconfig/environments/<env>.shwrapper that sets modules, compiler wrappers, and queue defaults. - Libraries: Update
<env>_libraries.jsonto expose additional MPI/NCCL builds, compiler flags, GPU capabilities, and metadata strings. The TUI reads these files at runtime.
pico/
βββ include/ # Public LibPICO API and instrumentation macros
βββ libpico/ # Custom collective implementations
βββ pico_core/ # Benchmark driver and MPI/NCCL glue code
βββ config/ # Environment, library, and algorithm JSON descriptors
βββ scripts/ # Submission, orchestration, metadata, and shell helpers
βββ tui/ # Textual UI for configuration authoring
βββ plot/ # Plotting package and CLI
βββ tracer/ # Network tracing and allocation analysis tools
βββ schedgen/ # Communication schedule generator (SPCL fork)
βββ tests/ # Sample exported configurations
βββ results/ # Generated data, metadata CSVs, and helper scripts
PICO is developed by Daniele De Sensi and Saverio Pasqualoni at the Department of Computer Science, Sapienza University of Rome. The project is licensed under the MIT License.
Schedgen code was originally released by SPCL @ ETH Zurich under the BSD 4-Clause license. The version bundled with PICO includes targeted modifications to support its extended scheduling and tracing workflow.