Status: Proposal
Version: 0.1.0-draft
Target Audience: Rust library authors, scientific computing engineers, numerical programmers
Companion document: rust-ml-design.md (ferrolearn) — ferray is its foundational dependency
NumPy is not a machine learning library. It is the substrate that everything else is built on. Its power comes from three things: a single, expressive N-dimensional array type; a complete and internally consistent mathematical function library operating on that type; and a unified, one-import-covers-everything experience that means a practitioner never has to think about which crate handles FFTs versus which one handles random number generation versus which one handles linear algebra.
The Rust ecosystem has fragments of this, but not the whole. ndarray provides the core array type and basic arithmetic. faer provides high-performance linear algebra. rustfft provides FFTs. ndarray-linalg wraps LAPACK (but requires the user to install system libraries). ndarray-rand provides random array generation. ndarray-stats provides statistics. These are all good crates, but they are not NumPy — they are a pile of parts that require the user to assemble them, manage version compatibility between them, and hold the mental model of which namespace to look in for which operation.
This separation of concerns causes friction for new users — where NumPy requires a single import numpy, using the equivalent Rust stack requires referring between documentation for several crates maintained by different sets of developers.
ferray closes this gap. It is a single crate providing the complete NumPy surface area, with Rust-native improvements where the language makes them possible. It does not replace ndarray or faer — it wraps and unifies them into a coherent whole that a practitioner can depend on without also depending on six other crates. ferrolearn depends on ferray as its array primitive.
ferray differs from ferrolearn in a critical way: NumPy is a primitive, not a toolkit. Where ferrolearn algorithms are composable units that sit on top of an array layer, ferray IS the array layer. Its design philosophy follows from this:
Completeness over minimalism. Partial coverage of NumPy is the current state of the art and it isn't good enough. ferray must cover all of NumPy's primary namespaces in a single dependency.
One import. use ferray::prelude::* should cover 95% of use cases, just as import numpy as np does in Python.
No mandatory system dependencies. The default build must compile on any Rust target without requiring OpenBLAS, LAPACK, or any system library. BLAS/LAPACK acceleration is opt-in via feature flags.
Rust improvements over NumPy where possible. The borrow checker, type generics, and const generics let ferray make guarantees at compile time that NumPy cannot — dimensionality mismatches, type errors, and mutability violations should be compiler errors, not runtime exceptions. Where Rust can prove correctness statically, it must.
Numerical parity with NumPy. The same six-layer correctness stack from ferrolearn applies here in full.
ferray is not a fork or replacement of any existing crate. It is a unification layer.
| Existing crate | Role in ferray |
|---|---|
ndarray |
Used internally as the storage engine for ferray-core. ferray::Array is not a newtype or re-export — it is an owned type with its own public API. Conversion traits (From<ndarray::Array>, Into<ndarray::Array>) are provided for interoperability, but ndarray does not appear in ferray's public type signatures. This insulates users from ndarray version churn and avoids the diamond-dependency problem that would arise if ndarray were part of the public ABI. |
faer |
Powers all BLAS-level operations internally via the ferray::linalg module |
rustfft |
Powers ferray::fft internally |
ndarray-linalg |
Superseded by ferray::linalg, which wraps faer directly rather than LAPACK |
ndarray-rand |
Superseded by ferray::random, which implements NumPy's Generator/BitGenerator model |
ndarray-stats |
Superseded by ferray's statistics methods on arrays |
num-complex |
Complex<f32> and Complex<f64> are the complex element types throughout |
half |
f16 support (for ML workloads) |
Users who have existing code using ndarray can interoperate with ferray directly: ferray::Array implements From<ndarray::Array> and Into<ndarray::Array>. These conversions are zero-copy where memory layouts are compatible. Because ndarray is not part of the public API surface, users do not need to add ndarray as a direct dependency — ferray manages the version internally.
The central type. Generic over element type T and dimensionality D.
pub struct NdArray<T, D: Dimension> {
data: OwnedRepr<T>, // or ViewRepr, MutViewRepr, ArcRepr, CowRepr
shape: D,
strides: D,
layout: MemoryLayout,
}
pub enum MemoryLayout {
C, // Row-major (C-contiguous) — default
Fortran, // Column-major (Fortran-contiguous)
Custom, // Arbitrary strides (views, non-contiguous slices)
}Type aliases mirror NumPy's naming conventions:
pub type Array1<T> = NdArray<T, Ix1>;
pub type Array2<T> = NdArray<T, Ix2>;
pub type Array3<T> = NdArray<T, Ix3>;
pub type ArrayD<T> = NdArray<T, IxDyn>; // Dynamic rank, like np.ndarray
// Float aliases
pub type F32Array1 = Array1<f32>;
pub type F64Array2 = Array2<f64>;
// ... etcUnlike NumPy — which allows multiple arrays to mutably alias the same data, tracked only at runtime — ferray expresses ownership in the type system:
| Type alias | Ownership | NumPy equivalent |
|---|---|---|
Array<T, D> |
Owned, uniquely | np.ndarray with no views alive |
ArrayView<'a, T, D> |
Borrowed immutably | Read-only slice/view |
ArrayViewMut<'a, T, D> |
Borrowed mutably | np.ndarray view |
ArcArray<T, D> |
Ref-counted, copy-on-write | Shared ndarray |
CowArray<'a, T, D> |
Either owned or borrowed | — |
ArcArray copy-on-write semantics: ArcArray wraps the data buffer in an Arc. A mutation (any operation requiring &mut access to elements) triggers a clone of the buffer if the Arc reference count is greater than 1, then proceeds on the cloned buffer. Views (ArrayView) derived from an ArcArray hold a clone of the Arc — they keep the buffer alive and prevent CoW from being triggered on the view. If the originating ArcArray is mutated after a view is taken, the view continues to see the old data (it holds the old Arc). There is no silent invalidation. The invariant: a live ArrayView<'a, ..> derived from an ArcArray always observes the data that existed at the moment the view was created, regardless of subsequent mutations to the source.
This is a strict improvement over NumPy. Aliasing bugs — where two views of the same array are modified in conflicting ways — are a known source of subtle NumPy errors that ferray makes impossible at compile time.
NumPy's dtype system is one of its most powerful features and the hardest to replicate in Rust. ferray must support all NumPy numeric dtypes as Rust generic parameters:
Floating point: f16, f32, f64, Complex<f32>, Complex<f64>
Signed integers: i8, i16, i32, i64, i128
Unsigned integers: u8, u16, u32, u64, u128
Boolean: bool
String: StringArray — a separate type backed by Vec<String> (see Section 10)
Structured dtypes (NumPy's np.dtype([('x', np.float64), ('y', np.int32)])) are supported via a derive macro:
#[derive(FerrayRecord)]
struct Point {
x: f64,
y: f64,
label: i32,
}
let points: Array1<Point> = ferray::zeros(100); // array-of-structs layout: [x0,y0,label0, x1,y1,label1, ...]
let xs: ArrayView1<f64> = points.field("x"); // zero-copy strided view of the x fieldferray implements NumPy's full broadcasting rules, including the cases that ndarray currently cannot handle. Broadcasting is implemented generically over dimension type and does not require materializing the broadcast array unless the operation demands it.
Broadcasting rules (identical to NumPy):
- If arrays have different numbers of dimensions, prepend
1s to the shape of the smaller-dimensioned array - Arrays with size
1along a dimension are stretched to match the other - Arrays whose sizes disagree and neither is
1produce aShapeError
let a: Array2<f64> = ferray::ones((4, 3));
let b: Array1<f64> = ferray::array([1.0, 2.0, 3.0]);
let c = &a + &b; // Broadcasts b from (3,) to (4, 3) — works
let d: Array2<f64> = ferray::ones((4, 1));
let e = &a * &d; // Broadcasts d from (4, 1) to (4, 3) — works
// Binary broadcasting between two non-trivial shapes:
let x: Array3<f32> = ferray::zeros((2, 1, 4));
let y: Array2<f32> = ferray::ones((3, 4));
let z = &x + &y; // z.shape() == [2, 3, 4]ferray must support all NumPy indexing modes:
Basic indexing — integer and slice indexing, always returns a view:
let a = ferray::arange(0.0f64, 12.0, 1.0).reshape((3, 4));
let row = a.slice(s![1, ..]); // row 1
let col = a.slice(s![.., 2]); // column 2
let sub = a.slice(s![..2, 1..3]); // submatrixAdvanced (fancy) indexing — integer array or boolean array indexing, always returns a copy:
let idx = ferray::array([0usize, 2]);
let rows = a.index_select(Axis(0), &idx); // select rows 0 and 2
let mask: Array2<bool> = &a > &ferray::scalar(5.0);
let selected = a.boolean_index(&mask); // 1D array of elements where mask is true
a.boolean_index_assign(&mask, 0.0); // np.ndarray[mask] = 0np.newaxis / None indexing:
let v: Array1<f64> = ferray::ones(5);
let col: Array2<f64> = v.insert_axis(Axis(1)); // shape (5,) → (5, 1)All NumPy array creation routines must be present in the ferray top-level namespace:
ferray::array([1.0, 2.0, 3.0]) // np.array(...)
ferray::asarray(&existing_vec) // np.asarray(...)
ferray::frombuffer(&bytes, dtype: f32) // np.frombuffer(...)
ferray::fromiter(iter, count: 100) // np.fromiter(...)ferray::zeros((3, 4)) // np.zeros
ferray::ones((3, 4)) // np.ones
ferray::full((3, 4), 7.5f64) // np.full
ferray::zeros_like(&other) // np.zeros_like
ferray::ones_like(&other)
ferray::full_like(&other, 0.0)
// np.empty — uninitialized allocation.
// Returns Array<MaybeUninit<T>, D>; caller must initialize all elements
// before calling .assume_init() to obtain Array<T, D>.
// This forces explicit acknowledgment of uninitialized state rather than
// silently handing back a typed array with garbage values.
ferray::empty((3, 4)) // → Array2<MaybeUninit<f64>>
ferray::empty_like(&other) // → Array<MaybeUninit<T>, D>ferray::arange(0.0, 10.0, 0.5) // np.arange
ferray::linspace(0.0, 1.0, 50) // np.linspace
ferray::logspace(0.0, 3.0, 10) // np.logspace
ferray::geomspace(1.0, 1000.0, 4) // np.geomspace
ferray::meshgrid(&[&x, &y]) // np.meshgrid
ferray::mgrid(s![0.0..1.0:0.1]) // np.mgrid
ferray::ogrid(s![0.0..1.0:0.1]) // np.ogridferray::eye::<f64>(4) // np.eye(4)
ferray::identity::<f64>(4) // np.identity(4)
ferray::diag(&v, k: 0) // np.diag
ferray::diagflat(&v) // np.diagflat
ferray::tri::<f64>(3, 3, 0) // np.tri
ferray::tril(&a, k: 0) // np.tril
ferray::triu(&a, k: 0) // np.triua.reshape((2, 6)) // a.reshape(...)
a.ravel() // a.ravel() — contiguous 1D view
a.flatten() // a.flatten() — owned copy
a.squeeze() // remove length-1 axes
a.expand_dims(axis: 1) // np.expand_dims
a.broadcast_to((4, 3, 2)) // np.broadcast_to — zero-copyferray::concatenate(arrays: &[&dyn AnyArray], axis: 0) // np.concatenate
ferray::stack(arrays: &[&dyn AnyArray], axis: 0) // np.stack
ferray::vstack(arrays: &[&dyn AnyArray]) // np.vstack
ferray::hstack(arrays: &[&dyn AnyArray]) // np.hstack
ferray::dstack(arrays: &[&dyn AnyArray]) // np.dstack
ferray::block(nested: &[&[&dyn AnyArray]]) // np.block
a.split(3, axis: 0) // np.split
a.array_split(5, axis: 0) // np.array_split
a.vsplit(3)
a.hsplit(3)
a.dsplit(3)a.transpose() // a.T
a.swapaxes(0, 1) // np.swapaxes
a.moveaxis(source: 0, destination: 2) // np.moveaxis
a.rollaxis(axis: 2, start: 0) // np.rollaxis
a.flip(axis: 0) // np.flip
a.fliplr()
a.flipud()
a.rot90(k: 1)
a.roll(shift: 3, axis: 0) // np.rollNumPy's universal function (ufunc) machinery is its core performance primitive — elementwise operations that broadcast, type-promote, and support output arrays. ferray must implement the complete ufunc surface.
The ufunc trait is generic over input and output element types. The reduce, accumulate, and outer methods use associated types so that, for example, a u8 sum ufunc produces u64 accumulations rather than attempting to operate on f64.
/// A universal function: a broadcast-aware, type-promoting elementwise operation.
pub trait Ufunc {
/// The input element type.
type In: Element;
/// The output element type (may differ — e.g., comparison ufuncs produce bool).
type Out: Element;
fn call(&self, input: &NdArray<Self::In, IxDyn>) -> NdArray<Self::Out, IxDyn>;
fn call_into(&self, input: &NdArray<Self::In, IxDyn>, out: &mut NdArray<Self::Out, IxDyn>);
/// Reduce along an axis: e.g., `add.reduce(a, axis=0)` → cumulative sum.
/// Output type may differ from `In` (e.g., integer sum accumulates to i64).
fn reduce(&self, a: &NdArray<Self::In, IxDyn>, axis: usize)
-> NdArray<Self::Out, IxDyn>;
/// Accumulate along an axis: running application of the operation.
fn accumulate(&self, a: &NdArray<Self::In, IxDyn>, axis: usize)
-> NdArray<Self::Out, IxDyn>;
/// Outer product: apply the operation to all pairs of elements.
fn outer(
&self,
a: &NdArray<Self::In, IxDyn>,
b: &NdArray<Self::In, IxDyn>,
) -> NdArray<Self::Out, IxDyn>;
}Binary ufuncs (e.g., add, multiply) are expressed as BinaryUfunc with two input type parameters:
pub trait BinaryUfunc {
type Lhs: Element;
type Rhs: Element;
type Out: Element;
fn call(
&self,
lhs: &NdArray<Self::Lhs, IxDyn>,
rhs: &NdArray<Self::Rhs, IxDyn>,
) -> NdArray<Self::Out, IxDyn>;
// reduce, accumulate, outer follow the same pattern
}Type promotion (e.g., f32 + i64 → f64) is handled at the call site by the promotion rules defined in Section 16.1, not inside the trait itself.
All functions are available both as free functions and as array methods.
sin, cos, tan, arcsin, arccos, arctan, arctan2, hypot, sinh, cosh, tanh, arcsinh, arccosh, arctanh, degrees, radians, deg2rad, rad2deg, unwrap
exp, exp2, expm1, log, log2, log10, log1p, logaddexp, logaddexp2
round (banker's rounding, matching NumPy), floor, ceil, trunc, fix, rint, around
add, subtract, multiply, divide, true_divide, floor_divide, power, remainder, mod_, fmod, divmod, absolute, fabs, sign, negative, positive, reciprocal, sqrt, cbrt, square, heaviside, gcd, lcm
isnan, isinf, isfinite, isneginf, isposinf, nan_to_num, nextafter, spacing, ldexp, frexp, signbit, copysign, float_power, fmax, fmin, maximum, minimum, clip
real, imag, conj, conjugate, angle, abs (returns real magnitude)
bitwise_and, bitwise_or, bitwise_xor, bitwise_not, invert, left_shift, right_shift
equal, not_equal, less, less_equal, greater, greater_equal, array_equal, array_equiv, allclose, isclose
logical_and, logical_or, logical_xor, logical_not, all, any
Mirrors numpy.linalg completely. Internally powered by faer with an optional BLAS backend.
linalg::dot(&a, &b) // np.dot
linalg::vdot(&a, &b) // np.vdot (complex conjugate)
linalg::inner(&a, &b) // np.inner
linalg::outer(&a, &b) // np.outer
linalg::matmul(&a, &b) // np.matmul / @
linalg::tensordot(&a, &b, axes: &[(0,1)]) // np.tensordot
linalg::einsum("ij,jk->ik", &[&a, &b]) // np.einsum — full Einstein notation parser
linalg::kron(&a, &b) // np.kronThe einsum implementation is a hard requirement. It is used pervasively in scientific computing, ML, physics simulations, and signal processing, and its absence is a significant gap in the current Rust ecosystem.
linalg::cholesky(&a) // np.linalg.cholesky → L
linalg::qr(&a, mode: QrMode::Reduced) // np.linalg.qr → (Q, R)
linalg::svd(&a, full_matrices: false) // np.linalg.svd → (U, S, Vt)
linalg::eig(&a) // np.linalg.eig → (eigenvalues, eigenvectors)
linalg::eigh(&a) // np.linalg.eigh — symmetric/Hermitian
linalg::eigvals(&a) // eigenvalues only
linalg::eigvalsh(&a)
linalg::lu(&a) // LU with partial pivoting → (P, L, U)linalg::solve(&a, &b) // np.linalg.solve — Ax = b
linalg::lstsq(&a, &b, rcond: None) // np.linalg.lstsq — least squares
linalg::inv(&a) // np.linalg.inv
linalg::pinv(&a, rcond: None) // np.linalg.pinv — Moore-Penrose pseudoinverselinalg::norm(&a, ord: NormOrder::Fro) // np.linalg.norm
linalg::cond(&a, p: None) // np.linalg.cond
linalg::det(&a) // np.linalg.det
linalg::slogdet(&a) // np.linalg.slogdet → (sign, logdet)
linalg::matrix_rank(&a, tol: None) // np.linalg.matrix_rank
linalg::trace(&a) // np.traceA key improvement over NumPy: all linalg functions that accept 2D arrays must also accept stacked arrays (3D+) and apply the operation along the last two axes, parallelized via Rayon. NumPy added this progressively and inconsistently; ferray requires it for all linalg functions from day one.
let batch: Array3<f64> = ferray::random::standard_normal((100, 4, 4));
let det: Array1<f64> = linalg::det(&batch); // 100 determinants, parallelizedMirrors numpy.fft completely. Internally powered by rustfft.
fft::fft(&a, n: None, axis: -1, norm: FftNorm::Backward) // np.fft.fft
fft::ifft(&a, n: None, axis: -1, norm: FftNorm::Backward) // np.fft.ifft
fft::fft2(&a, s: None, axes: (-2, -1)) // np.fft.fft2
fft::ifft2(&a, s: None, axes: (-2, -1)) // np.fft.ifft2
fft::fftn(&a, s: None, axes: None) // np.fft.fftn
fft::ifftn(&a, s: None, axes: None) // np.fft.ifftn
// Real FFTs (input is real, output is Hermitian-symmetric — more efficient)
fft::rfft(&a, n: None, axis: -1) // np.fft.rfft
fft::irfft(&a, n: None, axis: -1) // np.fft.irfft
fft::rfft2(&a, s: None, axes: (-2, -1))
fft::irfft2(&a, s: None, axes: (-2, -1))
fft::rfftn(&a, s: None, axes: None)
fft::irfftn(&a, s: None, axes: None)
// Hermitian input
fft::hfft(&a, n: None, axis: -1) // np.fft.hfft
fft::ihfft(&a, n: None, axis: -1) // np.fft.ihfft
// Frequencies
fft::fftfreq(n: 128, d: 1.0f64) // np.fft.fftfreq
fft::rfftfreq(n: 128, d: 1.0f64) // np.fft.rfftfreq
fft::fftshift(&a, axes: None) // np.fft.fftshift
fft::ifftshift(&a, axes: None) // np.fft.ifftshiftPlan caching: rustfft uses planning to optimize FFTs for repeated calls with the same size. ferray::fft must expose a FftPlan type that caches the plan and can be reused:
let plan = fft::FftPlan::new(1024);
for signal in signals.iter() {
let spectrum = plan.execute(signal);
}Mirrors numpy.random's modern Generator API (not the legacy np.random.* module functions). This is one of the most commonly needed and most scattered parts of the current Rust ecosystem.
// NumPy-style: create a Generator backed by a BitGenerator
let rng = random::default_rng(); // np.random.default_rng()
let rng = random::default_rng_seeded(42u64); // np.random.default_rng(42)
// Explicit BitGenerators
let rng = random::Generator::new(random::Pcg64::new(seed));
let rng = random::Generator::new(random::Philox::new(seed));
let rng = random::Generator::new(random::Xoshiro256StarStar::new(seed));Generator methods take &mut self — the generator is stateful and not Sync. Generating large arrays in parallel (e.g., a 10M-element normal sample) uses per-thread generators derived from the root via jump-ahead or stream-splitting:
// Parallel generation: internally creates per-Rayon-thread generators via
// BitGenerator::jump() (Xoshiro256**) or stream IDs (Philox).
// The resulting array is identical regardless of thread count — deterministic
// given the same root seed and output size.
let samples = rng.standard_normal_parallel((1_000_000,));
// Manual splitting for user-controlled parallelism:
let child_rngs: Vec<Generator> = rng.spawn(num_threads);The jump() operation advances the generator state by 2^128 steps, providing non-overlapping subsequences for each thread. Only BitGenerators that support jump-ahead (Xoshiro256StarStar, Philox) are eligible for parallel generation; Pcg64 falls back to sequential generation with a warning.
rng.random(size: (3, 4)) // Uniform [0, 1) — np.random.random
rng.uniform(low: 0.0, high: 1.0, size: (100,)) // np.random.uniform
rng.standard_normal(size: (100, 5)) // np.random.standard_normal
rng.normal(loc: 0.0, scale: 1.0, size: (100,)) // np.random.normal
rng.standard_exponential(size: (100,))
rng.exponential(scale: 1.0, size: (100,))
rng.gamma(shape: 2.0, scale: 1.0, size: (100,))
rng.beta(a: 2.0, b: 5.0, size: (100,))
rng.chisquare(df: 5.0, size: (100,))
rng.f(dfnum: 5.0, dfden: 2.0, size: (100,))
rng.student_t(df: 10.0, size: (100,))
rng.laplace(loc: 0.0, scale: 1.0, size: (100,))
rng.logistic(loc: 0.0, scale: 1.0, size: (100,))
rng.lognormal(mean: 0.0, sigma: 1.0, size: (100,))
rng.rayleigh(scale: 1.0, size: (100,))
rng.weibull(a: 5.0, size: (100,))
rng.pareto(a: 3.0, size: (100,))
rng.gumbel(loc: 0.0, scale: 1.0, size: (100,))
rng.power(a: 0.5, size: (100,))
rng.triangular(left: -3.0, mode: 0.0, right: 8.0, size: (100,))
rng.vonmises(mu: 0.0, kappa: 4.0, size: (100,))
rng.wald(mean: 3.0, scale: 2.0, size: (100,))
rng.standard_cauchy(size: (100,))
rng.standard_gamma(shape: 2.0, size: (100,))rng.integers(low: 0, high: 10, size: (100,)) // np.random.integers
rng.binomial(n: 10, p: 0.5, size: (100,))
rng.negative_binomial(n: 5, p: 0.5, size: (100,))
rng.poisson(lam: 5.0, size: (100,))
rng.geometric(p: 0.5, size: (100,))
rng.hypergeometric(ngood: 5, nbad: 10, nsample: 5, size: (100,))
rng.logseries(p: 0.9, size: (100,))
rng.multinomial(n: 20, pvals: &[0.2, 0.3, 0.5], size: 100)
rng.multivariate_normal(&mean, &cov, size: 100)
rng.dirichlet(&alpha, size: 100)rng.shuffle(&mut a) // np.random.shuffle — in-place
rng.permutation(&a) // np.random.permutation — returns copy
rng.permuted(&a, axis: 0) // np.random.permuted — along axis
rng.choice(&a, size: 10, replace: false, p: None) // np.random.choiceMirrors numpy.polynomial completely. Provides polynomial representations with stable numerical behavior.
Each polynomial class represents a different basis:
// Power basis (standard): p(x) = c[0] + c[1]*x + c[2]*x² + ...
let p = polynomial::Polynomial::new(&[1.0, -3.0, 2.0]); // np.polynomial.polynomial.Polynomial
// Chebyshev (better conditioned for approximation)
let c = polynomial::Chebyshev::new(&coeffs); // np.polynomial.chebyshev.Chebyshev
// Legendre
let l = polynomial::Legendre::new(&coeffs); // np.polynomial.legendre.Legendre
// Laguerre
let lag = polynomial::Laguerre::new(&coeffs);
// Hermite (physicist's and probabilist's)
let h = polynomial::Hermite::new(&coeffs);
let he = polynomial::HermiteE::new(&coeffs);All polynomial classes implement a common Poly trait. Basis conversion uses a canonical pivot — every type converts through Polynomial (power basis) — avoiding the coherence hole that would arise if convert<Q> required each Poly to know about every other.
/// Conversion through the power basis as canonical pivot.
pub trait ToPowerBasis {
fn to_power_basis(&self) -> Polynomial;
}
pub trait FromPowerBasis: Sized {
fn from_power_basis(p: &Polynomial) -> Self;
}
// Blanket: any two Poly types that go through power basis can convert freely.
impl<P: ToPowerBasis, Q: FromPowerBasis> From<P> for Q {
fn from(p: P) -> Q {
Q::from_power_basis(&p.to_power_basis())
}
}
pub trait Poly: Sized + ToPowerBasis + FromPowerBasis {
fn eval(&self, x: f64) -> f64; // p(x)
fn deriv(&self, m: usize) -> Self; // p'(x), p''(x), ...
fn integ(&self, m: usize, k: &[f64]) -> Self; // ∫p(x)dx
fn roots(&self) -> Array1<Complex<f64>>; // companion matrix eigenvalues
fn degree(&self) -> usize;
fn trim(&self, tol: f64) -> Self; // remove trailing near-zero coefficients
fn truncate(&self, size: usize) -> Self;
// Arithmetic (all return the appropriate type)
fn add(&self, other: &Self) -> Self;
fn sub(&self, other: &Self) -> Self;
fn mul(&self, other: &Self) -> Self;
fn pow(&self, n: usize) -> Self;
fn divmod(&self, other: &Self) -> (Self, Self);
// Fitting
fn fit(x: &Array1<f64>, y: &Array1<f64>, deg: usize) -> Self;
fn fit_weighted(x: &Array1<f64>, y: &Array1<f64>, deg: usize, w: &Array1<f64>) -> Self;
}Mirrors numpy.strings (NumPy 2.0+). Vectorized string operations on arrays of strings, operating elementwise with broadcasting.
let a: StringArray1 = ferray::strings::array(["hello", "world", "foo"]);
let b: StringArray1 = ferray::strings::array(["HELLO", "WORLD", "BAR"]);
strings::add(&a, &b) // elementwise concatenation
strings::multiply(&a, 3) // "hellohellohello"
strings::upper(&a) // np.strings.upper
strings::lower(&b) // np.strings.lower
strings::capitalize(&a)
strings::title(&a)
strings::center(&a, width: 10, fillchar: ' ')
strings::ljust(&a, width: 10)
strings::rjust(&a, width: 10)
strings::zfill(&a, width: 8)
strings::strip(&a, chars: None)
strings::lstrip(&a, chars: None)
strings::rstrip(&a, chars: None)
strings::replace(&a, old: "l", new: "L", count: None)
strings::startswith(&a, prefix: "h") // → Array1<bool>
strings::endswith(&a, suffix: "o") // → Array1<bool>
strings::find(&a, sub: "ll") // → Array1<i64>
strings::count(&a, sub: "l") // → Array1<usize>
strings::split(&a, sep: " ") // → Array1<Vec<String>>
// ragged output: each element is a Vec of the split parts.
// ferray does not have a jagged array type; Array1<Vec<String>>
// is the Rust-idiomatic representation for this case.
strings::join(sep: "-", sequence: &a) // → Array1<String>
// Regex support
strings::match_(&a, pattern: r"\w+") // → Array1<bool>
strings::extract(&a, pattern: r"(\w+)") // → StringArrayThese are array methods and free functions in the top-level ferray namespace, not a submodule:
a.sum(axis: None) // np.sum
a.prod(axis: None) // np.prod
a.min(axis: None) // np.min
a.max(axis: None) // np.max
a.argmin(axis: None) // np.argmin
a.argmax(axis: None) // np.argmax
a.mean(axis: None) // np.mean
a.var(axis: None, ddof: 0) // np.var
a.std(axis: None, ddof: 0) // np.std
a.median(axis: None) // np.median
a.percentile(&a, q: 50.0, axis: None) // np.percentile
a.quantile(&a, q: 0.5, axis: None) // np.quantile
// NaN-aware variants (critical for real data)
a.nansum(axis: None) // np.nansum
a.nanprod(axis: None)
a.nanmin(axis: None)
a.nanmax(axis: None)
a.nanmean(axis: None)
a.nanvar(axis: None, ddof: 0)
a.nanstd(axis: None, ddof: 0)
a.nanmedian(axis: None)
a.nanpercentile(&a, q: 50.0, axis: None)
// Correlations and covariance
ferray::correlate(&a, &v, mode: CorrelateMode::Full) // np.correlate
ferray::corrcoef(&x, rowvar: true) // np.corrcoef
ferray::cov(&m, rowvar: true, ddof: 1) // np.cov
// Histograms
ferray::histogram(&a, bins: 10, range: None, density: false) // np.histogram
ferray::histogram2d(&x, &y, bins: 10) // np.histogram2d
ferray::histogramdd(&sample, bins: 10) // np.histogramdd
ferray::bin_count(&x, weights: None, minlength: 0) // np.bincount
ferray::digitize(&x, &bins, right: false) // np.digitize
// Sorting and searching
a.sort(axis: -1, kind: SortKind::Stable) // np.sort
a.argsort(axis: -1) // np.argsort
ferray::searchsorted(&a, &v, side: Side::Left) // np.searchsorted
ferray::unique(&a, return_index: false, return_counts: false) // np.unique
ferray::nonzero(&a) // np.nonzero
ferray::where_(&condition, &x, &y) // np.where
ferray::count_nonzero(&a, axis: None) // np.count_nonzeroferray::union1d(&a, &b) // np.union1d
ferray::intersect1d(&a, &b, assume_unique: false) // np.intersect1d
ferray::setdiff1d(&a, &b, assume_unique: false) // np.setdiff1d
ferray::setxor1d(&a, &b, assume_unique: false) // np.setxor1d
ferray::in1d(&a, &test, assume_unique: false) // np.in1d
ferray::isin(&element, &test_elements) // np.isinMirrors numpy.lib.npyio — serialization to .npy and .npz formats, and text I/O.
// Binary formats
io::save("array.npy", &a)?; // np.save
io::load::<f64, _>("array.npy")? // np.load — caller specifies dtype at compile time
io::load_dynamic("array.npy")? // → Result<DynArray, FerrayError>
// DynArray is an enum over all supported dtypes;
// use when the file's dtype is not known ahead of time
io::savez("arrays.npz", &[("x", &x), ("y", &y)])? // np.savez
io::savez_compressed("arrays.npz.gz", ...)? // np.savez_compressed
// Text formats
io::savetxt("data.csv", &a, delimiter: b',', fmt: "%.6f")? // np.savetxt
io::loadtxt::<f64>("data.csv", delimiter: b',', skiprows: 1)? // np.loadtxt
io::genfromtxt("data.csv", delimiter: b',', filling_values: f64::NAN)? // np.genfromtxt
// Memory mapping for large arrays
io::memmap::<f64>("large_array.npy", mode: MemmapMode::ReadOnly)? // np.memmapDynArray is defined as:
pub enum DynArray {
F32(ArrayD<f32>),
F64(ArrayD<f64>),
I32(ArrayD<i32>),
I64(ArrayD<i64>),
U8(ArrayD<u8>),
Bool(ArrayD<bool>),
Complex64(ArrayD<Complex<f32>>),
Complex128(ArrayD<Complex<f64>>),
// ... all supported dtypes
}io::load::<T>() is a convenience that calls load_dynamic internally and returns Err(FerrayError::InvalidDtype { .. }) if the file's dtype does not match T.
ferray follows NumPy's promotion rules for mixed-type operations. The rule is: the output type is the smallest type that can represent all values of both input types without loss of precision.
| Lhs | Rhs | Result |
|---|---|---|
f32 |
f64 |
f64 |
f32 |
i64 |
f64 |
f64 |
i64 |
f64 |
i32 |
i64 |
i64 |
i32 |
f32 |
f64 (i32 doesn't fit in f32 without loss) |
u8 |
i16 |
i16 |
Complex<f32> |
f64 |
Complex<f64> |
bool |
i32 |
i32 |
Mixed-type binary operations on arrays do not compile implicitly. The caller must either call .astype() to convert explicitly, or use ferray::add_promoted(&a, &b) which returns NdArray<promoted_type!(A, B), _>. The promoted_type! macro resolves at compile time where both types are known statically; for dynamic dtypes it resolves at runtime via ferray::result_type.
This is a deliberate difference from NumPy, where np.array([1], dtype=np.int32) + np.array([1.0], dtype=np.float32) silently promotes. Silent promotion is a footgun when the intent was to operate on same-type arrays — ferray surfaces it as a type error and provides an explicit escape hatch.
a.astype::<f32>() // a.astype(np.float32)
a.view::<u8>() // reinterpret memory layout
ferray::result_type::<f32, i64>() // np.result_type
ferray::can_cast::<f32, f64>(CastKind::Safe) // np.can_cast
ferray::common_type(&[dtype1, dtype2]) // np.common_type
ferray::min_scalar_type(3.14f64) // np.min_scalar_type
ferray::promote_types::<i32, f64>() // np.promote_types
// Type checking predicates
ferray::issubdtype::<f32, FloatingPoint>() // np.issubdtype
ferray::isrealobj(&a) // np.isrealobj
ferray::iscomplexobj(&a) // np.iscomplexobjas_strided is marked unsafe because it can construct overlapping views of the same memory, which is unsound with Rust's aliasing rules if the resulting view is used mutably. This is an explicit exception to ferray's general guarantee that aliasing bugs are impossible — it is the intentional escape hatch for operations (windowed convolution, Toeplitz construction, overlapping tiles) that require it. The safety contract is documented at the call site.
// SAFE: non-overlapping strides, equivalent to a reshape
let view = stride_tricks::as_strided(&a, shape: (3, 3), strides: (8, 8));
// UNSAFE: overlapping strides — caller must ensure no concurrent mutation
// through any alias of the same memory region
let overlapping: ArrayView2<f64> = unsafe {
stride_tricks::as_strided_unchecked(&a, shape: (3, 3), strides: (4, 4))
};
// SAFE variants (no unsafe required):
stride_tricks::sliding_window_view(&a, window_shape: (3,)) // np.lib.stride_tricks.sliding_window_view — read-only views only
stride_tricks::broadcast_to(&a, shape: (10, 3, 4)) // np.broadcast_to — always read-only
ferray::broadcast_arrays(&[&a, &b]) // np.broadcast_arrays
ferray::broadcast_shapes(&[(3, 1), (1, 4)]) // np.broadcast_shapesMasked arrays allow representing missing or invalid data inline with the array — equivalent to numpy.ma.
let data = ferray::array([1.0, 2.0, f64::NAN, 4.0, 5.0]);
let mask = ferray::array([false, false, true, false, false]);
let ma = ma::MaskedArray::new(data, mask);
ma.mean() // ignores masked elements
ma.filled(fill_value: 0.0) // replace masked with fill value
ma.compressed() // 1D array of unmasked elements
ma.count() // number of valid (unmasked) elements
ma::masked_where(&condition, &a) // mask where condition is true
ma::masked_invalid(&a) // mask NaN and Inf
ma::masked_equal(&a, value: 0.0) // mask where equal to value
ma::masked_greater(&a, 5.0)
ma::masked_less(&a, 0.0)All inner loops that operate on contiguous memory must be SIMD-accelerated. This is not optional — it is a performance correctness requirement.
Runtime CPU dispatch is required. Compile-time SIMD width selection alone is insufficient: a binary compiled for the x86_64 baseline target would not use AVX2 even on a machine that has it. ferray uses the multiversion crate (or equivalent #[target_feature] + runtime detection) to ship one binary that selects the widest available SIMD path at startup via cpuid:
- 512-bit (AVX-512) if available
- 256-bit (AVX2) if available
- 128-bit (SSE2 / NEON) as the baseline fallback
- Scalar for non-x86/aarch64 targets
Compile-time target_feature flags can still lock to a specific width for embedded or controlled-environment deployments where the target is fully known.
The SIMD implementation pattern:
// Internal implementation of vectorized addition
fn add_f64_contiguous(a: &[f64], b: &[f64], out: &mut [f64]) {
use std::simd::f64x4;
let chunks = a.len() / 4;
let (a_chunks, a_tail) = a.split_at(chunks * 4);
let (b_chunks, b_tail) = b.split_at(chunks * 4);
let (out_chunks, out_tail) = out.split_at_mut(chunks * 4);
// SIMD main loop: 4 doubles at a time
for ((ac, bc), oc) in a_chunks.chunks_exact(4)
.zip(b_chunks.chunks_exact(4))
.zip(out_chunks.chunks_exact_mut(4))
{
let va = f64x4::from_slice(ac);
let vb = f64x4::from_slice(bc);
(va + vb).copy_to_slice(oc);
}
// Scalar tail
for ((av, bv), ov) in a_tail.iter().zip(b_tail).zip(out_tail) {
*ov = av + bv;
}
}Requirements:
- All unary and binary elementwise operations on contiguous arrays must use SIMD paths for
f32,f64,i32,i64,u8,u32 - The SIMD width adapts at compile time to the target architecture: 128-bit (SSE2/NEON), 256-bit (AVX2), or 512-bit (AVX-512)
- Non-contiguous arrays fall through to scalar loops; contiguity is checked at the start of each operation
criterionbenchmarks must verify that SIMD paths achieve at least 80% of theoretical peak memory bandwidth on contiguous inputs
ferray uses Rayon for automatic parallelism on operations that are large enough to benefit. The threshold for parallel dispatch is calibrated so that small arrays do not incur threading overhead.
// Parallel operations are transparent to the user:
let large: Array2<f64> = ferray::random::standard_normal((10_000, 10_000));
let result = (&large * 2.0 + 1.0).mapv(f64::exp); // auto-parallelized via Rayon
// Thread count control uses a ferray-owned thread pool, not Rayon's global pool.
// This avoids the "first initializer wins" problem when multiple crates share the
// Rayon global pool. ferray initializes its own rayon::ThreadPool at first use.
ferray::set_num_threads(4);
// Scoped serial execution via a per-call ThreadPool:
ferray::with_num_threads(1, || { /* serial execution — uses a 1-thread pool */ });
// Note: with_num_threads creates a temporary ThreadPool per call; avoid in hot paths.Parallel dispatch thresholds are calibrated empirically per operation class and architecture during the benchmark suite, not hardcoded. The initial defaults (subject to tuning):
- Elementwise ops (memory-bandwidth bound): parallel above ~10M elements on typical 4-core hardware
- Transcendental ufuncs (
exp,log,sin— compute-bound): parallel above ~1M elements - Reductions (
sum,min,max): parallel above ~10k elements with tree-reduce - Matrix multiplication: always parallel via
faer's threading model
Thresholds are exposed as configurable constants (ferray::config::PARALLEL_THRESHOLD_ELEMENTWISE, etc.) so downstream users can tune for their hardware. All parallel code must be deterministic when floating-point associativity is not required; reductions may differ in operand ordering across parallel runs — this is documented explicitly.
ferray arrays must implement a safe buffer protocol allowing zero-copy interoperation with other memory-owning types:
pub trait AsRawBuffer {
fn as_ptr(&self) -> *const u8;
fn shape(&self) -> &[usize];
fn strides(&self) -> &[isize];
fn dtype(&self) -> DType;
fn is_c_contiguous(&self) -> bool;
fn is_f_contiguous(&self) -> bool;
}A ferray-numpy companion crate (feature-flagged, not part of the main crate) provides zero-copy conversion between ferray::Array and NumPy arrays via PyO3:
#[pyo3::pyfunction]
fn process(py: Python, np_array: PyReadonlyArray2<f64>) -> PyResult<Py<PyArray2<f64>>> {
let arr: ferray::Array2<f64> = np_array.as_ferray()?; // zero-copy
let result = arr.mapv(f64::sqrt);
Ok(result.into_pyarray(py).into())
}ferray arrays must be convertible to/from Apache Arrow ArrayData with zero copy where memory layouts are compatible:
let arrow_array: arrow::array::Float64Array = ferray_array.to_arrow()?;
let ferray_array: ferray::Array1<f64> = arrow_array.into_ferray()?;All public functions return Result<T, FerrayError>. Panics are forbidden in library code.
#[non_exhaustive]
pub enum FerrayError {
ShapeMismatch { lhs: Vec<usize>, rhs: Vec<usize> },
BroadcastFailure { lhs: Vec<usize>, rhs: Vec<usize> },
AxisOutOfBounds { axis: usize, ndim: usize },
IndexOutOfBounds { index: usize, size: usize },
SingularMatrix,
ConvergenceFailure { iterations: usize },
InvalidDtype { expected: DType, got: DType },
NumericalInstability { context: String },
IoError(std::io::Error),
InvalidValue { context: String },
}The full six-layer correctness stack from ferrolearn Section 20 applies to ferray in its entirety, with the following ferray-specific additions:
Oracle fixture generation must use NumPy's output as ground truth. For every function, fixtures must cover all supported dtypes (f32, f64, Complex<f64>, relevant integer types), all valid shapes from 0-D to 5-D, and all edge cases (empty arrays, single-element arrays, NaN/Inf inputs, very large and very small values).
ufunc correctness requires special attention: broadcasting must be verified to produce bit-identical results to NumPy for all valid broadcast combinations, not just the simple cases. A fixture generator must enumerate a representative set of broadcast shape pairs for each ufunc.
SIMD path verification: both the SIMD and scalar code paths must be tested against the same fixture set. A test flag FERRAY_FORCE_SCALAR=1 must disable SIMD dispatch and allow running all fixture tests in scalar mode.
Numerical parity tolerance follows the same ULP budget as ferrolearn, with one addition: for transcendental functions (sin, exp, log, etc.) the budget is 1 ULP for correctly-rounded implementations (matching NumPy's target) and at most 4 ULPs for approximation-based implementations, with the budget documented per function.
ferray/
├── ferray/ # Main crate — one import covers everything
├── ferray-core/ # NdArray type, Dimension, DType, FerrayError
├── ferray-ufunc/ # Ufunc trait + all elementwise math
├── ferray-linalg/ # numpy.linalg — wraps faer
├── ferray-fft/ # numpy.fft — wraps rustfft
├── ferray-random/ # numpy.random — Generator/BitGenerator model
├── ferray-polynomial/ # numpy.polynomial — all polynomial classes
├── ferray-strings/ # numpy.strings — vectorized string ops
├── ferray-stats/ # Statistical functions
├── ferray-io/ # .npy/.npz, loadtxt, savetxt
├── ferray-ma/ # Masked arrays
├── ferray-stride-tricks/ # as_strided, sliding_window_view
└── ferray-numpy/ # PyO3 interop (optional, behind feature flag)
Feature flags on the main crate:
| Feature | Default | Description |
|---|---|---|
full |
No | Enables all features |
blas |
No | Link system BLAS for matmul and linalg |
rayon |
Yes | Parallel execution |
simd |
Yes | SIMD-accelerated inner loops |
complex |
Yes | Complex number support |
f16 |
No | Half-precision float support |
strings |
Yes | ferray::strings module |
ma |
Yes | ferray::ma masked array module |
io |
Yes | .npy/.npz file I/O |
serde |
No | Serialize/deserialize arrays via serde |
arrow |
No | Apache Arrow interop |
polars |
No | Polars DataFrame interop |
numpy |
No | PyO3 NumPy interop (requires pyo3) |
no_std |
No | Disable std for embedded targets (core numeric ops only) |
Minimum Supported Rust Version: 1.78 (stable, May 2024) — required for stabilized std::simd.
The public API of ferray must reach 1.0 before ferrolearn can reach 1.0, since ferrolearn depends on ferray as a primitive. The API stability contract at 1.0:
- No breaking changes to public types, traits, or function signatures without a major version bump
- Numerical output stability: for a fixed compiler version, target architecture, and feature flag set, output does not change between patch versions. A global guarantee of ≤1 ULP across compiler and toolchain updates is not feasible — LLVM floating-point optimization passes and FMA legalization differ between LLVM versions, and locking the LLVM version across patch releases is impractical. The guarantee is scoped to a fixed build environment.
- Changes that alter numerical output under a fixed build environment (e.g., replacing an approximation algorithm) are breaking and require a minor version bump with a changelog entry documenting the maximum observed difference.
NdArray<T, D>with full ownership model- All dtypes:
f32,f64,Complex<f64>,i32,i64,u8,bool - Full broadcasting (all NumPy broadcast cases)
- Basic and advanced (fancy + boolean) indexing
- All array creation routines
- All array manipulation (reshape, concatenate, stack, transpose, etc.)
- Complete ufunc surface (all math, bitwise, comparison, logical)
- SIMD paths for
f32andf64on x86_64 and aarch64 ferray::stats(sum, mean, var, sort, unique, etc.).npy/.npzI/O
ferray::linalg— full numpy.linalg parity includingeinsumferray::fft— full numpy.fft parity with plan cachingferray::random— full Generator API with all distributionsferray::polynomial— all polynomial classes- Rayon parallelism throughout
- Fixture test suite against NumPy for all P0 functions
ferray::strings— full numpy.stringsferray::ma— masked arraysferray::stride_tricks- Structured dtype support via derive macro
- Arrow and Polars interop
ferray-numpyPyO3 companion crate- Fuzz corpus (24 CPU-hours)
- Statistical equivalence benchmark suite
f16support throughout (currently gated behind feature flag due to limited hardware support in stdsimd)no_stdcore for embedded targets- Compile-time dimensionality checking via const generics (static shapes à la nalgebra, opt-in via
Array<T, Shape<3, 4>>syntax) - Automatic differentiation support via operator overloading (foundation for an autograd layer)
| Crate / Library | Relationship |
|---|---|
ndarray |
Used internally as the storage backend for ferray-core. Not part of the public API; insulates users from ndarray version churn |
faer |
Powers all ferray::linalg operations internally |
rustfft |
Powers ferray::fft internally |
nalgebra |
Complementary — fixed-size, compile-time-checked matrices; ferray covers dynamic |
ndarray-linalg |
Superseded by ferray::linalg for users who want a single dependency |
ndarray-rand |
Superseded by ferray::random |
ndarray-stats |
Superseded by ferray statistics methods |
num-complex |
Complex<f32> / Complex<f64> used as element types |
candle |
Deep learning framework that could depend on ferray as its array primitive |
polars |
DataFrame library; ferray arrays interop with Polars Series via Arrow |
ferrolearn |
Depends on ferray as its foundational array layer |
Key improvements over NumPy itself:
- Mutability aliasing bugs are impossible — the borrow checker enforces what NumPy's view system only documents (
as_stridedis explicitlyunsafe fnwith documented caller preconditions) predict()on unfitted models and similar logical errors are compile errors, not runtime exceptions (inherited fromferrolearn's design)- SIMD dispatch is more explicit and auditable than NumPy's internal C/Fortran dispatch, with runtime CPU feature detection for AVX2/AVX-512
- Uninitialized arrays (
empty) returnArray<MaybeUninit<T>, D>, forcing explicit initialization before use rather than silently handing back a typed array with garbage values - Broadcasting failures are compile errors for statically-known shapes (Phase 4 const generics work)
- Mixed-type operations require explicit promotion rather than silent type coercion
- All parallel code is explicitly opt-in via a ferray-owned Rayon pool rather than implicitly controlled by environment variables (contrast: NumPy's
OMP_NUM_THREADS) - Zero-cost zero-copy views are enforced by the type system rather than tracked at runtime