This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
strided-rs is a Rust library providing cache-optimized kernels for strided multidimensional array operations. It is a port of Julia's Strided.jl/StridedViews.jl libraries, currently built on top of the mdarray crate.
Current Status (v0.1):
- Broadcasting with
CaptureArgsfor lazy evaluation (stride-0 for size-1 dims) - Zero-copy transformations: slice, reshape, permute, transpose
- Lazy element operations with type-level composition (
Identity,Conj,Transpose,Adjoint) - Cache-optimized map/reduce/broadcast kernels
- Overlapping src/dest memory is not supported
Before pushing or creating a pull request, all of the following must pass:
cargo fmt --check # formatting
cargo test # all testsIf cargo fmt --check fails, run cargo fmt to fix formatting automatically.
# Build
cargo build
# Run all tests
cargo test
# Run a single test
cargo test test_map_into_transposed
# Check formatting
cargo fmt --check
# Run benchmarks
cargo bench
# Run a specific benchmark
cargo bench -- copy_permutedWhen adding or modifying benchmarks (especially "naive" baselines), optimize the baseline as well:
- Avoid per-element high-level indexing (
a[[i, j]]) inside hot loops when the data is contiguous; prefer pointer-based loops or precomputed strides so the "naive" number reflects math + memory traffic, not indexing overhead. - Keep setup/allocation out of the timed region and use
black_boxto prevent dead-code elimination. - For parity with Julia scripts, run single-threaded (
RAYON_NUM_THREADS=1/JULIA_NUM_THREADS=1) unless explicitly testing threading.
By default rustc targets a generic x86-64 baseline (SSE2 only). To enable AVX2/AVX-512 auto-vectorization for the host CPU:
RUSTFLAGS="-C target-cpu=native" cargo benchThis can yield significant improvements for contiguous inner loops that LLVM auto-vectorizes.
StridedArrayView<'a, T, N, Op>/StridedArrayViewMut<'a, T, N, Op>: Const-generic strided views where:T: Element typeN: Number of dimensions (const generic)Op: Element operation (Identity, Conj, Transpose, Adjoint) - applied lazily on access
| Module | Purpose | Julia Equivalent |
|---|---|---|
view.rs |
StridedArrayView/StridedArrayViewMut types with slicing, permutation, reshape, broadcast |
stridedview.jl |
element_op.rs |
Element operations (Identity, Conj, Transpose, Adjoint) with type-level composition |
FN, FC, FT, FA |
kernel.rs |
StridedView/StridedViewMut internal wrappers, _mapreduce_kernel! implementation |
mapreduce.jl |
map.rs |
map_into, zip_map2_into, zip_map3_into, zip_map4_into |
Base.map! |
reduce.rs |
reduce, reduce_axis, mapreducedim_into |
Base.mapreduce, Base.mapreducedim! |
broadcast.rs |
CaptureArgs, promoteshape, broadcast_into, Arg, Scalar |
broadcast.jl |
ops.rs |
High-level operations: copy_into, add, mul, axpy, fma, sum, dot, symmetrize_into |
Various |
order.rs |
Dimension ordering algorithm - sorts dimensions by stride magnitude | indexorder |
block.rs |
Block size computation to fit within L1 cache (_computeblocks) |
_computeblocks |
fuse.rs |
Dimension fusion for contiguous dimensions | _mapreduce_fuse! |
auxiliary.rs |
Helper functions: index_order, normalize_strides, simplify_dims |
auxiliary.jl |
The library uses a blocking strategy faithful to Strided.jl:
- Dimension Fusion: Contiguous dimensions are fused to reduce loop overhead (
_mapreduce_fuse!) - Dimension Reordering: Dimensions are sorted by stride importance for optimal cache access (
_mapreduce_order!) - Tiled Iteration: Operations are blocked into tiles fitting L1 cache (
_computeblocks) - Contiguous Fast Paths: Contiguous arrays bypass blocking for direct iteration
| Constant | Value | Julia Equivalent |
|---|---|---|
BLOCK_MEMORY_SIZE |
32KB | BLOCKMEMORYSIZE |
CACHE_LINE_SIZE |
64 bytes | _cachelinelength |
mdarray(v0.7.2): Base multidimensional array typenum-traits/num-complex: Numeric trait boundsthiserror: Error type derivationbytemuck: POD trait for byte-copy fast paths
| Julia Module | Rust Module | Status |
|---|---|---|
StridedViews.jl/stridedview.jl |
view.rs |
✅ Complete |
StridedViews.jl/auxiliary.jl |
auxiliary.rs |
✅ Complete |
Strided.jl/mapreduce.jl |
kernel.rs, map.rs, reduce.rs, fuse.rs, block.rs, order.rs |
✅ Complete |
Strided.jl/broadcast.jl |
broadcast.rs |
✅ Complete |
Strided.jl/convert.jl |
(via copy_into) |
✅ Complete |
Strided.jl/macros.jl |
N/A |
| Julia | Rust | Notes |
|---|---|---|
StridedView(array) |
StridedArrayView::new() |
Const-generic N |
sview(a, indices...) |
view.slice() |
Zero-copy slicing |
sreshape(a, dims) |
view.sreshape_strided() |
Stride-preserving reshape |
permutedims(a, perm) |
view.permute() |
Zero-copy permutation |
Base.map!(f, dest, srcs...) |
map_into, zip_map*_into |
Up to 4 sources |
Base.mapreducedim!(f, op, dest, src) |
mapreducedim_into |
Dimension reduction |
promoteshape(dims, arrays...) |
promoteshape, promoteshape2, promoteshape3 |
Broadcasting |
CaptureArgs |
CaptureArgs<F, A> |
Lazy broadcast |
The extern/ directory contains reference implementations:
extern/Strided.jl/: Original Julia implementation (fully ported)extern/StridedViews.jl/: Julia package defining theStridedViewtype (fully ported)extern/mdarray/: The Rust mdarray crate source for reference
Design documentation:
docs/STRIDED_DESIGN.md: Detailed analysis of Julia implementations and Rust porting guide
- Const generics for dimension count:
StridedView<T, N>whereNis const - Type-level element operations: Avoid runtime dispatch for conj/transpose via the
ElementOptrait - Result-based error handling: Return
Result<_, StridedError>for invalid operations - Trait-based extensibility:
ElementOp,Reduce,Maptraits for customization
- Explore explicit SIMD intrinsics to close remaining gap with Julia's
@simd
See README.md for benchmark results comparing Rust strided vs naive baselines and Julia Strided.jl.