CUDA Flocking Simulation

Overview

This project implements a real-time boid flocking simulation using CUDA to leverage GPU parallelization. The simulation features thousands of autonomous agents (boids) that exhibit emergent flocking behaviors through simple rules of cohesion, separation, and alignment.

Key Features

High-performance CUDA implementation supporting up to millions of boids
Three simulation methods:
- Naive (brute force) approach
- Scattered grid-based neighbor search
- Coherent grid-based neighbor search with memory optimization
Interactive visualization for real-time observation
Performance testing mode for benchmarking different methods
Block size optimization for finding optimal CUDA execution configurations
Headless operation support for running performance tests without a display

System Requirements

For Visualization Mode

CUDA-capable NVIDIA GPU
OpenGL 3.3+ support
GLFW and GLEW libraries
Display server (X11, Wayland, XQuartz)

For Performance Testing Only

CUDA-capable NVIDIA GPU
No GLFW, OpenGL, or display server required
Can run on headless servers and cloud instances

Quick Start

Building the Project

# Navigate to project directory
cd CUDA-Flocking

# Build the project
make

Running the Visualization

To run the visualization with default settings (5000 boids, naive method):

./build/bin/cis565_boids

To specify simulation method and boid count:

./build/bin/cis565_boids --mode [naive|scattered|coherent] --boids [count]

Example:

./build/bin/cis565_boids --mode coherent --boids 25000

Controls

Left mouse button: Rotate camera
Right mouse button: Zoom in/out
ESC: Exit the application

Performance Testing

Single Method Performance Test

To run performance tests for a specific method and boid count:

./build/bin/cis565_boids --perf-test [naive|scattered|coherent] [count]

Example:

./build/bin/cis565_boids --perf-test coherent 50000

Full Performance Testing

To test all boid sizes with a specific method:

./build/bin/cis565_boids --perf-test [naive|scattered|coherent]

Example:

./build/bin/cis565_boids --perf-test coherent

Block Size Performance Testing

To test how different CUDA block sizes affect performance across all methods:

./build/bin/cis565_boids --perf-test-block-size

This will run a comprehensive test of all three methods (Naive, Scattered Grid, and Coherent Grid) with different block sizes (32, 64, 128, 256, 512, and 1024) to find the optimal execution configuration.

Headless Operation

All performance test modes (--perf-test and --perf-test-block-size) can now run on headless systems without GLFW or display server requirements. The code automatically uses std::chrono for timing in performance test modes instead of GLFW, making it possible to benchmark on cloud servers and compute instances that don't have displays attached.

Implementation Details

Optimization Approaches

1. Parameter Optimization

We carefully tuned the flocking parameters to achieve natural-looking behavior while maintaining performance:

Increased maxSpeed from 1.8 to 2.5 for faster and more dynamic movement
Enhanced alignment factor (rule3Scale) from 0.15 to 0.25 for smoother directional changes
Adjusted cohesion and separation weights for better flocking behavior
Reduced time step (dt) from 0.2 to 0.1 for smoother animation

2. Algorithmic Optimizations

Naive Implementation:

Brute-force approach that compares each boid with every other boid
O(n²) complexity, suitable for small simulations
Used squared distances to avoid expensive sqrt operations

Scattered Grid Implementation:

Divides space into a uniform grid for efficient neighbor searching
Only examines boids in neighboring cells instead of all boids
Uses Thrust library for efficient sorting of boids by grid cell
Maintains original data structure with indirect access using indices

Coherent Grid Implementation:

Similar to scattered grid but reorganizes data for memory coherence
Rearranges boid data to be contiguous in memory based on spatial locality
Enables more efficient memory access patterns on the GPU
Swaps pointers instead of copying data back to avoid expensive transfers
Added explicit synchronization points to ensure data consistency
Velocity-copy fix – a silent black-screen bug turned out to be caused by using std::swap on velocity pointers twice in the coherent path.
The second swap left the next-frame neighbour search reading stale data, so every boid computed a zero update and never reached the renderer.
The cure was to keep the first ping-pong swap but replace the second with an explicit device-to-device copy:
```
cudaMemcpy(d_sortedVelocitiesA,            // dst – coherent buffer
           d_velocitiesA,                  // src – freshly-updated velocities
           numObjects * sizeof(glm::vec3),
           cudaMemcpyDeviceToDevice);
```

3. Memory Optimizations

Used ping-pong buffering for velocity updates to avoid race conditions
Pre-computed squared distances for performance
Implemented memory-efficient grid cell indexing
Optimized memory access patterns for better coalescing

4. Block Size Optimization

Added ability to test different CUDA thread block sizes
Implemented dynamic block size selection for optimal performance
Performance varies significantly based on block size and simulation method
Block sizes between 64-128 typically offer best performance for compute-bound workloads
Grid-based methods are less sensitive to block size due to memory-bound nature

Technical Details

Key CUDA Kernels

kernUpdateVelocityBruteForce: Updates velocities using brute-force approach
kernUpdateVelNeighborSearchScattered: Updates velocities using grid-based neighbor search with scattered memory layout
kernUpdateVelNeighborSearchCoherent: Updates velocities using grid-based search with coherent memory layout
kernUpdatePos: Updates boid positions based on velocities and handles boundary conditions
kernResetAndComputeIndices: Assigns boids to grid cells and resets grid data structures
kernIdentifyCellStartEnd: Identifies the start and end indices for each grid cell
kernRearrangeBoidData: Reorganizes boid data for memory coherence

Simulation Steps

For the coherent grid implementation (most optimized):

Reset grid cell buffers and compute boid grid indices
Sort boids by grid cell index
Identify start/end indices for each grid cell
Rearrange boid data to be coherent in memory
Update velocities using neighbor search
Update positions using newly calculated velocities
Swap buffer pointers for next iteration

Performance Analysis

Method Efficiency

The performance hierarchy from least to most efficient:

Naive: O(n²) complexity, suitable only for small simulations
Scattered Grid: O(n) complexity with some memory access inefficiency
Coherent Grid: O(n) complexity with optimized memory access patterns

Block Size Impact

Based on performance testing with 25,000 boids:

Naive Implementation:

Most sensitive to block size changes
Optimal block size: 128 threads (261 FPS)
Performance drops with both smaller (32) and larger (1024) block sizes
This suggests the workload is compute-bound and benefits from balanced occupancy

Scattered Grid Implementation:

Less sensitive to block size after 64 threads
Optimal block size: 64-128 threads (~8880 FPS)
Performance is relatively stable across block sizes 64-1024
This suggests the workload becomes memory-bound after sufficient threads are available

Coherent Grid Implementation:

Minimal sensitivity to block size
Consistent performance across all tested block sizes (~9500-9600 FPS)
Block size 32 shows slightly better performance in some cases
This suggests memory access patterns are well-optimized and the workload is primarily memory-bandwidth limited

Observations

Linear scaling
Both grid-based approaches scale O(n) in practice. FPS degrades roughly inversely with boid count, while the naive method collapses once pairwise n² comparisons dominate cache and SM occupancy.
Coherent vs. Scattered
Re-ordering boid data for memory coherence yields an extra 25 – 40 % speed-up across the tested range. Benefits grow with scene density because global memory transactions become the dominant cost.
Block Size Optimization Block size has varying impact depending on the simulation method:
- Naive method: Block size matters significantly (up to 16% performance difference)
- Grid methods: Less sensitive once minimum occupancy is achieved
- For general purposes, a block size of 128 offers good performance across all methods
Velocity-copy fix
The black-screen bug (see "Algorithmic Optimizations → Coherent Grid Implementation") masked itself as a performance hit. After copying freshly-updated velocities back into the coherent buffer, FPS jumped by > 30 % and boids rendered correctly.
Million-boid milestone
The coherent implementation sustains ~800 FPS at one million boids, limited chiefly by grid-cell occupancy in shared memory rather than raw arithmetic throughput.

The coherent grid implementation typically provides 10-100x speedup compared to naive for large boid counts.

Note that optimal performance depends on your specific GPU. For reference, on an NVIDIA GeForce RTX 3080:

Coherent grid achieves 60+ FPS with 100,000 boids
Scattered grid achieves 60+ FPS with 50,000 boids
Naive implementation achieves 60+ FPS with only 5,000 boids

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.vscode		.vscode
cmake		cmake
external		external
images		images
shaders		shaders
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Flocking Simulation

Overview

Key Features

System Requirements

For Visualization Mode

For Performance Testing Only

Quick Start

Building the Project

Running the Visualization

Controls

Performance Testing

Single Method Performance Test

Full Performance Testing

Block Size Performance Testing

Headless Operation

Implementation Details

Optimization Approaches

1. Parameter Optimization

2. Algorithmic Optimizations

3. Memory Optimizations

4. Block Size Optimization

Technical Details

Key CUDA Kernels

Simulation Steps

Performance Analysis

Method Efficiency

Block Size Impact

Observations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Flocking Simulation

Overview

Key Features

System Requirements

For Visualization Mode

For Performance Testing Only

Quick Start

Building the Project

Running the Visualization

Controls

Performance Testing

Single Method Performance Test

Full Performance Testing

Block Size Performance Testing

Headless Operation

Implementation Details

Optimization Approaches

1. Parameter Optimization

2. Algorithmic Optimizations

3. Memory Optimizations

4. Block Size Optimization

Technical Details

Key CUDA Kernels

Simulation Steps

Performance Analysis

Method Efficiency

Block Size Impact

Observations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages