A research-grade polyhedral compiler for optimizing loop nests
Features • Installation • Quick Start • Examples • Commands • Theory
PolyOpt is a polyhedral compiler that automatically optimizes loop nests for:
- Parallelism: Identify and exploit parallel loops
- Data Locality: Improve cache utilization through tiling
- Vectorization: Enable SIMD optimizations
It takes programs written in a simple .poly language and generates optimized C code with OpenMP annotations.
┌─────────────┐ ┌──────────────┐ ┌────────────────┐ ┌─────────────┐
│ .poly file │ ──► │ Parse & IR │ ──► │ Dependence │ ──► │ Transform │
│ │ │ Lowering │ │ Analysis │ │ & Schedule │
└─────────────┘ └──────────────┘ └────────────────┘ └──────┬──────┘
│
▼
┌─────────────┐
│ Generate │
│ C/OpenMP │
└─────────────┘
- Frontend: Custom parser for a simple loop-focused language
- Analysis:
- Dependence analysis (GCD test, Banerjee test)
- Parallelism detection
- Loop-carried vs loop-independent dependencies
- Transformations:
- Loop tiling (blocking) for cache optimization
- Loop interchange for better memory access
- Loop fusion and distribution
- Automatic scheduling (Pluto, Feautrier, Greedy algorithms)
- Code Generation:
- Clean C output
- OpenMP parallelization pragmas
- Vectorization hints
- Benchmark code generation
- Advanced Features (optional):
- ISL Integration: Exact polyhedral math via Integer Set Library
- Auto-Tuning: Automatic parameter optimization
PolyOpt can use the Integer Set Library (ISL) for exact polyhedral computations. ISL is the same library used by LLVM's Polly and GCC's Graphite.
Installation:
# macOS
brew install isl
# Ubuntu/Debian
apt install libisl-dev isl-utils
# Fedora
dnf install isl-develBuild with ISL support:
cargo build --release --features islUsage:
# Check ISL installation
polyopt isl input.poly --check
# Analyze domains in ISL format
polyopt isl input.poly --domain
# Compute optimal schedule
polyopt isl input.poly --schedule
# Show dependences in ISL format
polyopt isl input.poly --depsExample output:
=== ISL Analysis: matmul ===
--- Domains (ISL format) ---
S0:
{ [i0,i1,i2] : 0 <= i0 < N and 0 <= i1 < M and 0 <= i2 < K }
ISL canonical: [N, M, K] -> { [i0, i1, i2] : 0 <= i0 < N and 0 <= i1 < M and 0 <= i2 < K }
PolyOpt includes an auto-tuning framework that finds optimal transformation parameters by empirical search.
Build with auto-tuning:
cargo build --release --features autotuningUsage:
# Basic auto-tuning (60 second budget)
polyopt autotune examples/matmul.poly -N 1000
# Quick mode (30 seconds)
polyopt autotune examples/matmul.poly --quick
# Custom search parameters
polyopt autotune examples/matmul.poly \
--time 120 \
--tiles "16,32,64,128" \
--threads "1,2,4,8" \
--search genetic
# Save results to CSV
polyopt autotune examples/matmul.poly -o results.csvSearch strategies:
grid: Exhaustive search (default, best for small spaces)random: Random sampling (good for large spaces)genetic: Genetic algorithm (good for complex optimization)
Example output:
=== Auto-Tuning: matmul.poly ===
Configuration:
Max time: 60s
Tile sizes: [16, 32, 64, 128]
Thread counts: [1, 2, 4, 8]
Search: Grid
Baseline: 0.4521s
[ 1] t16_p1 -> 0.3892s (1.16x)
[ 2] t32_p1 -> 0.3456s (1.31x)
[ 3] t32_p4 -> 0.1234s (3.66x)
*** New best: t32_p4 (3.66x speedup)
[ 4] t64_p4 -> 0.0987s (4.58x)
*** New best: t64_p4 (4.58x speedup)
...
=== Best Configuration Found ===
Tile sizes: [64]
Threads: 4
Speedup: 4.58x
To generate optimized code:
polyopt compile matmul.poly --openmp --tile 64
- Rust 1.70+ (install from rustup.rs)
- GCC or Clang (for compiling generated code)
# Clone the repository
git clone https://github.com/yourusername/polyopt.git
cd polyopt
# Build in release mode
cargo build --release
# Run tests
cargo test
# Install (optional)
cargo install --path .# Check version
cargo run --bin polyopt -- --version
# See help
cargo run --bin polyopt -- --helpCreate a file hello.poly:
// Vector addition: C[i] = A[i] + B[i]
func vector_add(A[N], B[N], C[N]) {
for i = 0 to N {
C[i] = A[i] + B[i];
}
}cargo run --bin polyopt -- compile hello.polyOutput:
// Generated by PolyOpt - Polyhedral Optimizer
#include <stdio.h>
#include <stdlib.h>
...
void vector_add(int N, double* A, double* B, double* C) {
// Statement S0: S0
for (int i = 0; i < N; i++) {
C[i] = (A[i] + B[i]);
}
}cargo run --bin polyopt -- compile hello.poly --openmpcargo run --bin polyopt -- analyze hello.poly --deps --parallelcargo run --bin polyvis -- hello.poly --params N=8// Comments use // or /* */
func function_name(ArrayA[N][M], ArrayB[K], scalar_param) {
// For loops
for i = start to end {
// Statements
ArrayA[i][j] = expression;
}
// For loops with step
for i = 0 to N step 2 {
...
}
}// 1D array with symbolic size N
A[N]
// 2D array with sizes N x M
B[N][M]
// 3D array
C[N][M][K]// Arithmetic
A[i] + B[j]
A[i] * 2.5
A[i] / B[j]
A[i] - B[j]
// Array access with affine indices
A[i]
B[i][j]
C[i+1][j-1]
C[2*i][j]
// Constants
0
1.5
3.14159// Simple bounds
for i = 0 to N { ... }
// Offset bounds
for i = 1 to N-1 { ... }
// Triangular bounds (planned)
for j = 0 to i { ... }Compile a .poly file to C code.
# Basic compilation
polyopt compile input.poly
# Output to file
polyopt compile input.poly -o output.c
# With OpenMP
polyopt compile input.poly --openmp
# With vectorization hints
polyopt compile input.poly --openmp --vectorize
# Generate benchmark wrapper
polyopt compile input.poly --benchmarkOptions:
| Option | Description |
|---|---|
-o, --output |
Output file (default: stdout) |
-t, --target |
Target: c, openmp, cuda, opencl |
--openmp |
Enable OpenMP parallelization |
--vectorize |
Add SIMD pragmas |
--benchmark |
Generate timing code |
Analyze dependencies and parallelism opportunities.
# Basic analysis
polyopt analyze input.poly
# Show dependencies
polyopt analyze input.poly --deps
# Show parallelism
polyopt analyze input.poly --parallel
# Full analysis
polyopt analyze input.poly --deps --parallel --stats
# JSON output
polyopt analyze input.poly --jsonExample Output:
=== Analysis: matmul ===
Statements: 2
Parameters: ["N", "K", "M"]
Arrays: ["A", "C", "B"]
--- Dependences ---
Total: 6
Flow (RAW): 2
Anti (WAR): 2
Output (WAW): 2
Loop-carried: 6
Loop-independent: 0
--- Parallelism ---
Level 0: ✗ sequential
Level 1: ✗ sequential
Level 2: ✗ sequential
No directly parallelizable loops found
Consider applying transformations (tiling, skewing)
Apply optimizations and generate code.
# Auto-optimize with Pluto scheduler
polyopt optimize input.poly --schedule pluto
# Apply tiling
polyopt optimize input.poly --tile 32
# Combined optimizations
polyopt optimize input.poly --schedule pluto --tile 32 --openmp
# Output to file
polyopt optimize input.poly --tile 32 -o optimized.cOptions:
| Option | Description |
|---|---|
--tile SIZE |
Apply tiling with given block size |
--schedule ALG |
Scheduling: pluto, feautrier, greedy |
--openmp |
Enable OpenMP in output |
-o, --output |
Output file |
View the parsed representation of a program.
# Show AST
polyopt parse input.poly
# Show HIR (High-level IR)
polyopt parse input.poly --hir
# Show PIR (Polyhedral IR)
polyopt parse input.poly --pirCompile and run benchmarks.
# Run with default size (N=1000)
polyopt bench input.poly
# Specify size
polyopt bench input.poly -N 2000
# Multiple iterations
polyopt bench input.poly -N 1000 --iterations 10
# With OpenMP
polyopt bench input.poly -N 1000 --openmpText-based visualization of iteration spaces and dependencies.
# Basic visualization
polyvis input.poly
# With parameter values
polyvis input.poly --params N=10,M=10
# Limit displayed iterations
polyvis input.poly --params N=20 --max-iters 100
# Verbose mode (show dependency details)
polyvis input.poly --params N=8 --verboseExample Output:
╔══════════════════════════════════════════════════════════════════╗
║ PolyOpt Iteration Space Visualizer ║
╚══════════════════════════════════════════════════════════════════╝
Program: matmul
Parameters: N=8, K=8, M=8
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Statement S0 (2D domain):
Iteration space (8 x 8 = 64 points):
j→ 0 1 2 3 4 5 6 7
i↓
0 ● ● ● ● ● ● ● ●
1 ● ● ● ● ● ● ● ●
2 ● ● ● ● ● ● ● ●
...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Dependence Summary:
Flow (RAW): 2
Anti (WAR): 2
Output (WAW): 2
Parallelism: No parallel loops detected at any level
# File: examples/basic/vector_add.poly
polyopt analyze examples/basic/vector_add.poly --parallel
# Output: Level 0: ✓ PARALLEL# File: examples/basic/array_copy.poly
polyopt compile examples/basic/array_copy.poly --openmp# Analyze dependencies
polyopt analyze examples/linalg/matmul.poly --deps
# Generate tiled code
polyopt optimize examples/linalg/matmul.poly --tile 32 --openmp -o matmul_tiled.c
# Benchmark
polyopt bench examples/linalg/matmul.poly -N 512 --openmp# File: examples/linalg/matvec.poly
polyopt compile examples/linalg/matvec.poly --openmp# File: examples/linalg/transpose.poly
# Both loops parallelizable
polyopt analyze examples/linalg/transpose.poly --parallel# File: examples/stencils/jacobi_2d.poly
# Reads from A, writes to B - no loop-carried dependencies
polyopt analyze examples/stencils/jacobi_2d.poly --parallel
polyopt compile examples/stencils/jacobi_2d.poly --openmp# File: examples/stencils/gauss_seidel_2d.poly
# In-place update creates dependencies
polyopt analyze examples/stencils/gauss_seidel_2d.poly --deps
# Shows loop-carried dependencies - needs wavefront parallelization# File: examples/stencils/stencil_2d_5pt.poly
polyvis examples/stencils/stencil_2d_5pt.poly --params N=10# File: examples/reductions/vector_sum.poly
polyopt compile examples/reductions/vector_sum.poly --openmp
# Generates: #pragma omp parallel for reduction(+:result)# File: examples/reductions/dot_product.poly
polyopt compile examples/reductions/dot_product.poly --openmp# File: examples/transformations/interchange_example.poly
# Original has poor cache behavior (column-major access)
polyopt analyze examples/transformations/interchange_example.poly --deps
polyopt optimize examples/transformations/interchange_example.poly --schedule pluto# File: examples/transformations/tiling_example.poly
# Matrix multiply benefits from tiling
polyopt optimize examples/transformations/tiling_example.poly --tile 32 -o tiled.c# File: examples/transformations/fusion_example.poly
# Multiple loops over same iteration space
polyopt analyze examples/transformations/fusion_example.poly --depsCreate mykernel.poly:
func saxpy(X[N], Y[N]) {
for i = 0 to N {
Y[i] = 2.0 * X[i] + Y[i];
}
}$ polyopt analyze mykernel.poly --deps --parallel
=== Analysis: saxpy ===
Statements: 1
Parameters: ["N"]
Arrays: ["X", "Y"]
--- Dependences ---
Total: 0
Flow (RAW): 0
Anti (WAR): 0
Output (WAW): 0
--- Parallelism ---
Level 0: ✓ PARALLEL
Recommendation: Parallelize at level 0$ polyvis mykernel.poly --params N=10
Statement S0 (1D domain):
Iteration space (10 points):
i→ 0 1 2 3 4 5 6 7 8 9
● ● ● ● ● ● ● ● ● ●
Parallelism: Parallel at level 0$ polyopt compile mykernel.poly --openmp
void saxpy(int N, double* restrict X, double* restrict Y) {
// Statement S0: S0
#pragma omp parallel for
for (int i = 0; i < N; i++) {
Y[i] = ((2.0 * X[i]) + Y[i]);
}
}$ polyopt bench mykernel.poly -N 10000000 --iterations 5 --openmp
=== Benchmark: saxpy ===
Size: N = 10000000
Iterations: 5
OpenMP: enabled
Running 5 iterations...
Run 1: Time: 0.0234 seconds
Run 2: Time: 0.0231 seconds
Run 3: Time: 0.0229 seconds
Run 4: Time: 0.0232 seconds
Run 5: Time: 0.0230 secondsThe polyhedral model represents loop nests as geometric objects:
- Iteration Space: The set of all loop iterations, represented as integer points in a polyhedron
- Schedule: A mapping that determines execution order
- Access Functions: Affine functions describing array accesses
| Traditional Compiler | Polyhedral Compiler |
|---|---|
| Pattern matching | Mathematical reasoning |
| Limited scope | Global optimization |
| Heuristic | Exact dependence analysis |
Original code:
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];Polyhedral representation:
- Domain:
{ S[i,j,k] : 0 ≤ i,j,k < N } - Schedule:
{ S[i,j,k] → [i,j,k] }(original order) - Read accesses:
{ S[i,j,k] → A[i,k] },{ S[i,j,k] → B[k,j] },{ S[i,j,k] → C[i,j] } - Write access:
{ S[i,j,k] → C[i,j] }
| Type | Also Called | Pattern | Example |
|---|---|---|---|
| Flow | RAW (Read After Write) | Write → Read | A[i] = ...; ... = A[i]; |
| Anti | WAR (Write After Read) | Read → Write | ... = A[i]; A[i] = ...; |
| Output | WAW (Write After Write) | Write → Write | A[i] = ...; A[i] = ...; |
| Transformation | Effect | When to Use |
|---|---|---|
| Tiling | Block loops for cache | Large working sets |
| Interchange | Swap loop order | Improve memory access pattern |
| Fusion | Merge loops | Improve temporal locality |
| Distribution | Split loops | Enable parallelization |
| Skewing | Add loop-carried values | Enable wavefront parallelism |
examples/
├── basic/
│ ├── vector_add.poly # Simple parallel loop
│ ├── vector_scale.poly # Scalar multiplication
│ ├── array_copy.poly # 2D parallel copy
│ └── array_init.poly # Index-dependent init
│
├── linalg/
│ ├── matmul.poly # Matrix multiplication
│ ├── matvec.poly # Matrix-vector multiply
│ ├── transpose.poly # Matrix transpose
│ ├── trisolve.poly # Triangular solve
│ ├── symm.poly # Symmetric matrix multiply
│ ├── atax.poly # A^T * A * x kernel
│ └── bicg.poly # Biconjugate gradient
│
├── stencils/
│ ├── stencil_1d.poly # 1D 3-point stencil
│ ├── stencil_2d_5pt.poly # 2D 5-point stencil
│ ├── stencil_2d_9pt.poly # 2D 9-point stencil
│ ├── jacobi_2d.poly # Jacobi iteration (parallel)
│ ├── gauss_seidel_2d.poly # Gauss-Seidel (sequential)
│ └── conv2d.poly # 2D convolution
│
├── reductions/
│ ├── vector_sum.poly # Sum reduction
│ ├── dot_product.poly # Dot product
│ ├── frobenius_norm.poly # Matrix norm
│ └── prefix_sum.poly # Prefix sum (sequential)
│
├── parallel/
│ ├── embarrassingly_parallel.poly # Perfect parallelism
│ └── rowsum.poly # Parallel outer, reduction inner
│
└── transformations/
├── interchange_example.poly # Loop interchange demo
├── tiling_example.poly # Tiling demo
├── fusion_example.poly # Loop fusion demo
└── distribution_example.poly # Loop distribution demo
- Use affine bounds:
0 to N,1 to N-1, not0 to f(x) - Use affine indices:
A[i][j],A[i+1][j-1], notA[B[i]] - Separate read/write arrays when possible (enables more parallelism)
- Use compound assignment for reductions:
sum = sum + x
- Analyze first: Understand dependencies before optimizing
- Check parallelism: Use
--parallelto find parallel loops - Consider tiling: Large iteration spaces benefit from tiling
- Benchmark: Always verify optimizations improve performance
| Pattern | Parallelizable? | Optimization |
|---|---|---|
| Element-wise ops | ✓ Yes | OpenMP parallel for |
| Stencil (double-buffer) | ✓ Yes | Tiling + OpenMP |
| Stencil (in-place) | ✗ No | Wavefront/skewing |
| Reduction | Partial | OpenMP reduction clause |
| Sequential scan | ✗ No | Algorithm change needed |
The variable wasn't declared as a loop iterator or parameter.
// Wrong - T not defined
for i = 0 to T { ... }
// Right - T is a parameter
func foo(A[T]) {
for i = 0 to T { ... }
}Check for loop-carried dependencies:
polyopt analyze input.poly --depsCommon causes:
- In-place array updates:
A[i] = f(A[i-1]) - Reduction without proper pattern
- Complex index expressions
Ensure indices match array dimensions:
// Wrong - A is 1D but accessed as 2D
func foo(A[N]) {
for i = 0 to N {
A[i][j] = 0; // Error!
}
}PolyOpt includes a polyhedral math module with both a pure Rust simulation and optional native ISL support.
# Parse and analyze an ISL set expression directly
polyopt isl "{ [i, j] : 0 <= i < 10 and 0 <= j < 10 }" --expr --enumerate
# Analyze a .poly file's domains
polyopt isl examples/basic/array_copy.poly --domain
# Generate tiled loop code
polyopt isl "{ [i, j] : 0 <= i < N and 0 <= j < N }" --expr --tile 32
# Show scheduling info
polyopt isl examples/linalg/matmul.poly --schedule=== ISL Polyhedral Analysis ===
Using: Pure Rust polyhedral simulation
(Install ISL for exact operations: brew install isl)
Set: { [i, j] : i >= 0 and -i + 9 >= 0 and j >= 0 and -j + 9 >= 0 }
Variables: ["i", "j"]
Constraints: 4
--- Point Enumeration ---
0: ([0, 0])
1: ([0, 1])
...
Total points: 100
For exact polyhedral operations, install ISL:
# macOS
brew install isl
# Ubuntu/Debian
sudo apt install libisl-dev isl-utils
# Fedora
sudo dnf install isl-develThe auto-tuner automatically searches for optimal transformation parameters by compiling and benchmarking multiple variants.
# Basic auto-tuning with OpenMP
polyopt autotune input.poly --openmp
# Specify tile sizes to try
polyopt autotune input.poly --tiles "8,16,32,64,128" --openmp
# Use different search strategies
polyopt autotune input.poly --strategy exhaustive # Try all combinations
polyopt autotune input.poly --strategy random # Random sampling
polyopt autotune input.poly --strategy genetic # Genetic algorithm
polyopt autotune input.poly --strategy annealing # Simulated annealing
# Customize benchmark parameters
polyopt autotune input.poly -N 1000 --iterations 5 --openmp
# Save results to file
polyopt autotune input.poly -o results.csv --openmp| Strategy | Description | Use When |
|---|---|---|
exhaustive |
Tests all combinations | Small search space, need optimal |
random |
Random sampling (20 samples) | Quick exploration |
genetic |
Evolutionary optimization | Large search space |
annealing |
Simulated annealing | Avoid local minima |
Contributions welcome! Areas of interest:
- Additional scheduling algorithms
- GPU code generation (CUDA/OpenCL)
- More transformation passes
- Performance benchmarking
MIT License - see LICENSE file
PolyOpt is inspired by:
- ISL - Integer Set Library
- Pluto - Automatic parallelizer
- Polly - LLVM polyhedral optimizer
- PolyBench - Polyhedral benchmark suite
Made with ❤️ for the HPC community