A high-performance Rust framework for GPU computation with unified support for NVIDIA (CUDA) and AMD (ROCm/HIP) platforms.
⚠️ WARNING: This library is in early experimental stage and NOT production readyThis project has many limitations and known issues. It is primarily intended for research and experimentation. APIs may change without notice, performance optimizations are incomplete, and error handling is still developing.
GPU Compute provides a tensor-based computational model similar to PyTorch/TensorFlow but in native Rust, with strong type safety and memory management. It handles complex GPU programming while exposing a high-level API for developers to focus on algorithms rather than hardware details.
- Cross-Platform: Unified API for both NVIDIA (CUDA) and AMD (ROCm/HIP) GPUs
- Tensor Operations: Comprehensive set of operations for numerical computing and ML
- Memory Management: Efficient device, host, and unified memory abstractions
- Kernel Execution: Easy launching of custom GPU kernels with automatic configuration
- Neural Network Primitives: Optimized convolutions, pooling, and normalization operations
- BLAS Integration: Level 1-3 BLAS operations for linear algebra
- JIT Compilation: Runtime code generation for specialized kernels
- Debugging Tools: Profiling, visualization, and memory inspection utilities
- Safe Abstractions: Rust wrappers around unsafe FFI calls with robust error handling
Add this to your Cargo.toml:
[dependencies]
# For NVIDIA GPUs
gpu-compute = { path = "path/to/gpu-compute", features = ["cuda"] }
# For AMD GPUs
gpu-compute = { path = "path/to/gpu-compute", features = ["rocm"] }
# For both
gpu-compute = { path = "path/to/gpu-compute", features = ["cuda", "rocm"] }- CUDA Toolkit (version 11.0 or higher recommended)
- Set
CUDA_PATHenvironment variable to your CUDA installation path
- ROCm installation (version 4.0 or higher recommended)
- Set
ROCM_PATHenvironment variable to your ROCm installation path
use gpu_compute::context;
use gpu_compute::error::GpuResult;
fn main() -> GpuResult<()> {
// Initialize GPU subsystem
context::initialize()?;
// Use the GPU context
context::with_context(|ctx| {
println!("Using device: {}", ctx.get_device(ctx.current_device()).unwrap().name());
// Do GPU operations...
Ok(())
})?;
// Cleanup
context::shutdown()?;
Ok(())
}use gpu_compute::context;
use gpu_compute::error::GpuResult;
use gpu_compute::tensor::{Tensor, Shape};
use gpu_compute::ops::elementwise;
fn main() -> GpuResult<()> {
context::initialize()?;
context::with_context(|ctx| {
// Create tensors on device
let shape = Shape::new(vec![2, 3]);
let a = Tensor::ones(shape.clone(), ctx.current_device())?;
let b = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape, ctx.current_device())?;
// Perform operations
let c = elementwise::add(&a, &b)?;
// Transfer result to host and print
let host_result = c.to_host_vec()?;
println!("Result: {:?}", host_result);
Ok(())
})?;
context::shutdown()?;
Ok(())
}use gpu_compute::context;
use gpu_compute::error::GpuResult;
use gpu_compute::tensor::{Tensor, Shape};
use gpu_compute::ops::blas_level2_3;
fn main() -> GpuResult<()> {
context::initialize()?;
context::with_context(|ctx| {
// Create matrices
let a = Tensor::from_vec(
vec![1.0, 2.0, 3.0, 4.0],
Shape::new(vec![2, 2]),
ctx.current_device()
)?;
let b = Tensor::from_vec(
vec![5.0, 6.0, 7.0, 8.0],
Shape::new(vec![2, 2]),
ctx.current_device()
)?;
// Perform matrix multiplication
let c = blas_level2_3::matmul(&a, &b, false, false)?;
// Transfer result to host and print
let host_result = c.to_host_vec()?;
println!("Result: {:?}", host_result);
Ok(())
})?;
context::shutdown()?;
Ok(())
}# Build with CUDA support
cargo build --features cuda
# Build with ROCm support
cargo build --features rocm
# Run tests
cargo test --features cuda # or --features rocm
# Run benchmarks
cargo bench --features cuda # or --features rocmuse gpu_compute::context;
use gpu_compute::error::GpuResult;
use gpu_compute::debug::profiler;
fn main() -> GpuResult<()> {
context::initialize()?;
context::with_context(|ctx| {
// Start profiling
let mut profile_session = profiler::ProfileSession::new("my_operation")?;
// ... your GPU operations here ...
// End profiling and print results
profile_session.end()?;
profile_session.print_summary();
Ok(())
})?;
context::shutdown()?;
Ok(())
}This library has several significant limitations and known issues:
- No Automatic Differentiation: Unlike PyTorch or TensorFlow, there is no autodiff support for deep learning
- Limited Tensor Operations: Many common operations are still missing or incomplete
- Performance Issues: Several operations have suboptimal implementations
- Memory Management: Inefficient memory slicing and no caching allocators
- No Multi-GPU Support: No infrastructure for distributed computation
- Limited Documentation: API documentation is minimal
- No Python Bindings: No easy integration with the Python ML ecosystem
- Testing Coverage: Incomplete test coverage for many components
Contributions are welcome! Please feel free to submit a Pull Request.
Areas where help is especially needed:
- Adding missing tensor operations
- Improving performance of existing operations
- Expanding test coverage
- Enhancing documentation
- Adding Python bindings
This project is licensed under the MIT License - see the LICENSE file for details.