num-wasm/
├── build.zig # Zig build config (wasm32-freestanding target)
├── src/
│ ├── ndarray.zig # Core NDArray struct, get/set, creation, shape ops
│ ├── ops.zig # Element-wise and reduction operations
│ ├── linalg.zig # dot, matmul
│ ├── wasm_api.zig # WASM exports (thin wrapper over core)
│ └── root.zig # Module root, re-exports ndarray
├── web/
│ ├── numwasm.js # JS glue: typed API over raw WASM pointers
│ └── test.js # Node.js test runner
└── tests/ # (future) additional test files
Design principle: Keep it simple. One data type (f64), flat array storage, no strides. All core logic lives in platform-agnostic .zig files. wasm_api.zig is a thin wrapper for WASM exports. Test natively with zig build test, test WASM with node web/test.js.
Simplification choices (can be upgraded later):
f64only — no DType enum, no generic type dispatch- Flat
[]f64— no[*]u8pointer casting, no strides - Always copy — no views, no ownership tracking. Simpler memory model.
- Row-major (C-contiguous) only — no Fortran order
Goal: Prove Zig → WASM → JS works end-to-end.
- 1.1
zig initthe project, configurebuild.zigwithwasm32-freestandingtarget,.entry = .disabled,.rdynamic = true, optimize for.ReleaseFast - 1.2 Write a single
export fn add(a: i32, b: i32) i32inwasm_api.zig - 1.3 Write
web/test.js(Node.js) that loads the.wasmviaWebAssembly.instantiate, testsadd(),wasm_alloc/free, andsum_f64 - 1.4 Export
wasm_alloc(len) → [*]u8andwasm_free(ptr, len)usingstd.heap.wasm_allocator— this is the memory bridge for everything later - 1.5 Test passing a
Float64Arrayfrom JS into WASM: allocate buffer in WASM, copy JS data into it viaFloat64Arrayview ofmemory.buffer, call a Zigsum_f64(ptr, len) → f64, verify result vianode web/test.js
Goal: Build the struct everything else depends on. Simple flat []f64 storage.
- 2.1 Define
NDArraystruct insrc/ndarray.zig:data: []f64,shape: []usize,ndim: usize,allocator: Allocator - 2.2 Implement
init(allocator, shape) → NDArray— compute total elements from shape, allocate[]f64of that size, duplicate the shape slice (caller may pass stack memory) - 2.3 Implement
deinit(self)— freedata, then freeshape - 2.4 Implement
flatIndex(self, indices) → usize— convert N-dimensional indices to flat index. For shape(3, 4):flatIndex([1, 2])=1*4 + 2=6. Formula: walk right-to-left, multiply by cumulative product of dimensions - 2.5 Implement
getItem(self, indices) → f64— callsflatIndex, returnsself.data[flat] - 2.6 Implement
setItem(self, indices, value)— callsflatIndex, writesself.data[flat] = value - 2.7 Update
src/root.zigto re-export:pub const NDArray = @import("ndarray.zig").NDArray; - 2.8 Write tests: create
(3, 4)array, set/get values, verifyflatIndexmath, verify memory cleanup withstd.testing.allocator
Index math (no strides needed):
flatIndex([i, j]) for shape (R, C) = i*C + j
flatIndex([i, j, k]) for shape (A, B, C) = i*B*C + j*C + k
General formula: flat = Σ(indices[d] * product(shape[d+1..]))
Goal: Equivalent of np.zeros(), np.ones(), np.arange(), etc.
- 3.1
zeros(allocator, shape) → NDArray— init + fill data with0.0using@memset - 3.2
ones(allocator, shape) → NDArray— init + fill data with1.0 - 3.3
full(allocator, shape, value) → NDArray— init + fill data with arbitrary value - 3.4
arange(allocator, start, stop, step) → NDArray— compute count =@intFromFloat(@ceil((stop - start) / step)), create 1D arrayshape = [count], fill withstart, start+step, start+2*step, ... - 3.5
linspace(allocator, start, stop, count) → NDArray— 1D array ofcountevenly spaced values fromstarttostopinclusive - 3.6 Export creation functions via
wasm_api.zig, test from Node.js
Goal: reshape, transpose, flatten. All return new arrays (copies).
- 4.1
reshape(self, allocator, new_shape) → NDArray— validateproduct(new_shape) == product(old_shape), create new NDArray, copy data (@memcpy), assign new shape - 4.2
transpose(self, allocator) → NDArray— 2D only. Create new(cols, rows)array, copyresult[j][i] = self[i][j] - 4.3
flatten(self, allocator) → NDArray— reshape to[total_elements] - 4.4
squeeze(self, allocator) → NDArray— remove dimensions of size 1 from shape, copy data - 4.5 Tests: reshape
(2, 6)→(3, 4), verify values preserved. Transpose(3, 4)→ verify(4, 3)with correct element positions
Goal: Enable operations between arrays of different shapes.
- 5.1 Implement
broadcastShapes(allocator, shape_a, shape_b) → []usize— align from right, each pair must be equal or one must be 1, result = max of pair, error if incompatible - 5.2 Implement
broadcastIndex(indices, original_shape, broadcast_shape) → []usize— for each dimension, iforiginal_shape[d] == 1, use index0regardless of the broadcast index (the element repeats) - 5.3 Tests:
(3, 4) + (4,)→(3, 4)(3, 1) + (1, 4)→(3, 4)- scalar +
(3, 4)→(3, 4) (3, 4) + (3, 5)→ error
How broadcasting works without strides: Instead of the stride=0 trick, use index remapping. When iterating the output shape, map each output index back to each input's index — if an input dim is 1, always use 0 for that dimension.
Goal: add(a, b), sqrt(a), etc. — using broadcasting.
- 6.1 Implement
binaryOp(allocator, a, b, comptime op_fn) → NDArray— broadcast shapes, allocate output, iterate all indices in output, usebroadcastIndexto get values from a and b, apply op - 6.2 Wire up:
add,subtract,multiply,divide - 6.3 Implement
unaryOp(allocator, a, comptime op_fn) → NDArray— allocate output with same shape, apply op to each element - 6.4 Wire up:
negate,abs,sqrt,exp,log— use@sqrt,@exp,@logbuiltins andstd.math - 6.5 Scalar-array ops:
addScalar,mulScalar— simpler fast path, no broadcasting needed - 6.6 Export key ops via WASM, test from Node.js
Goal: sum(), mean(), max(), min() — with optional axis.
- 7.1 Full reduction (no axis): iterate
self.data, accumulate → returnf64 - 7.2 Axis reduction: compute output shape (input shape with axis removed). Iterate output indices, for each one loop over the reduced dimension, accumulate
- 7.3 Implement generic
reduce(allocator, arr, comptime op, comptime identity, axis)— shared iteration logic - 7.4 Wire up:
sum,mean(sum / count),max,min,prod - 7.5
argmax,argmin— return index of best value - 7.6 Tests:
(3, 4)sum axis 0 →(4,), sum axis 1 →(3,), no axis → scalar
Goal: Extract sub-arrays. All operations return copies (no views in simplified approach).
- 8.1
slice(self, allocator, dim, start, stop, step) → NDArray— compute output shape, iterate and copy matching elements - 8.2 Multi-dim slicing: chain single-dim slices
- 8.3
indexAxis(self, allocator, dim, i) → NDArray— select one position along dim, reduce ndim by 1 - 8.4 Boolean masking
where(self, allocator, mask) → NDArray— copy elements where mask is non-zero - 8.5 Negative indexing:
shape[dim] + indexfor negative values - 8.6 Tests: slice
(4, 5)array, verify correct elements. Negative index. Boolean mask
Goal: Dot product, matrix multiply.
- 9.1
dot(a, b) → f64— 1D only,Σ(a.data[i] * b.data[i]) - 9.2
matmul(allocator, a, b) → NDArray— 2D,(m,k) × (k,n) → (m,n), naive triple loop - 9.3
outer(allocator, a, b) → NDArray—result[i*n + j] = a[i] * b[j] - 9.4 Tests: dot product, 2×3 times 3×2 matmul, outer product
Goal: Clean JS API that hides pointer/memory management.
- 10.1
NumWasmclass — asyncinit()loads WASM, stores exports - 10.2
NdArrayJS class — holds pointer + shape. Methods:.toArray(),.toTypedArray(),.free() - 10.3
nw.array(jsData)— detect shape from nesting, flatten, copy into WASM, returnNdArray - 10.4 Method wrappers:
nw.add(a, b),nw.sum(a, {axis: 0}),nw.reshape(a, [3, 4]), etc. - 10.5
FinalizationRegistry— auto-cleanup with console warning if user forgets.free() - 10.6 Error handling — check null pointers from WASM (allocation failures), throw typed JS errors
Target API:
const nw = await NumWasm.init();
const a = nw.array([[1, 2, 3], [4, 5, 6]]);
const b = nw.ones([2, 3]);
const c = nw.add(a, b);
const s = nw.sum(c, { axis: 0 });
console.log(s.toArray()); // [6, 8, 10]
a.free(); b.free(); c.free(); s.free();- Test natively first —
zig build testis instant. Don't debug in the browser until you must. std.testing.allocatordetects memory leaks — if yourdeinitmisses a free, the test fails. This is your safety net.- Print from WASM — Import a
consoleLog(ptr, len)from JS env for debugging. - Upgrade path — when you hit performance limits later, the upgrade to strides +
[*]u8+ DType is mechanical: same API surface, different internals.
These aren't needed now, but here's how to evolve the design:
| Upgrade | When | What changes |
|---|---|---|
| Strides | Transpose/slice copies become a bottleneck | Replace []f64 with [*]u8 + strides, views instead of copies |
| Multi-dtype | Need i32/f32/u8 arrays | Add DType enum, use [*]u8 with @ptrCast for type dispatch |
| WASM SIMD | Element-wise ops need more speed | Use @Vector(4, f64) in hot loops |
| Views | Memory usage too high from copies | Add owns_data flag + base pointer |