Interactive examples demonstrating WebGPU primitives and operations.
+
+
+
+
Scan Example
+
Compact example that demonstrates exclusive vs inclusive scans using DLDFScan: how to choose scanType, build the
+ input array, run the primitive, and validate outputs. Good for understanding prefix-sum semantics and the effect of
+ different binary ops (add/max/min).
+
Open the page, pick parameters in the pane (datatype/binop/input length), click Start, then inspect the result
+ and plots.
+
+
Scan Types
+
Exclusive: Output[i] = sum of all elements before position i (output[0] is identity).
+
Inclusive: Output[i] = sum of all elements up to and including position i.
+
+
Binary Operations
+
Add, Min, Max — combines elements using the specified operation. Custom operations supported via binop.mjs.
+
+
Supported Data Types
+
u32, i32, f32 — all data types work with all binary operations.
+
+
Code Structure
+
+ // Exclusive scan with add operation
+ const primitive = new DLDFScan({
+ device,
+ binop: new BinOpAdd({ datatype: "u32" }),
+ type: "exclusive",
+ datatype: "u32"
+ });
+
+ // Inclusive scan with max operation
+ const primitive = new DLDFScan({
+ device,
+ binop: new BinOpMax({ datatype: "i32" }),
+ type: "inclusive",
+ datatype: "i32"
+ });
+
Demonstrates OneSweepSort with two modes: key-only sorting and key-value pair sorting with payload validation. Shows
+ how to configure the primitive for different operations and data types.
+
Open the page, pick parameters in the pane (datatype/binop/input length), click Start, then inspect the result
+ and plots.
+
+
Sort Operations
+
Sort Keys: Sorts only the keys array in ascending or descending order.
+
Sort Pairs: Sorts keys while maintaining associated payloads, useful for sorting complex data
+ structures.
+
+
Supported Data Types
+
u32, i32, f32 — all data types support both sort modes with configurable sort direction.
A minimal, hands-on example showing how to run the reduce primitive end-to-end: device setup, buffer upload, a single
+ execution, and result readback. Use this to learn the basic API calls and validation pattern. Parameters shown:
+ datatype and binop (add/max/min). Ideal as the first example before any benchmarking.
+
Open the page, pick parameters in the pane (datatype/binop/input length), click Start, then inspect the result
+ and plots.
+
+
What is Reduce?
+
Reduces an entire array to a single value by repeatedly applying a binary operation across all elements.
+
+
Binary Operations
+
Add, Min, Max — aggregates all elements using the specified operation. Custom operations supported via binop.mjs.
+
+
Supported Data Types
+
u32, i32, f32 — all data types work with all binary operations.
+
+
Code Structure
+
+ // Reduce with add operation
+ const primitive = new DLDFScan({
+ device,
+ binop: new BinOpAdd({ datatype: "u32" }),
+ type: "reduce",
+ datatype: "u32"
+ });
+
+ // Reduce with min operation
+ const primitive = new DLDFScan({
+ device,
+ binop: new BinOpMin({ datatype: "i32" }),
+ type: "reduce",
+ datatype: "i32"
+ });
+
Comprehensive guides and technical documentation for GridWise WebGPU primitives.
+
+
+
+
Architecture
+
Overview of GridWise's system design, module structure, and how primitives are organized for extensibility and
+ performance. Learn about the high-level organization of GridWise components, including how different primitives
+ (scan, reduce, sort) are implemented as modular, reusable units. Understand the architectural decisions that enable
+ performance optimization while maintaining clean separation of concerns and ease of extension.
Deep dive into the design principles behind GridWise primitives with focus on single-pass chained algorithms for
+ sort, scan, and reduce. Explores the tradeoffs between using subgroup instructions for maximum performance versus
+ software emulation for broader compatibility. Covers memory bandwidth considerations, the lookback and fallback
+ optimization techniques, and how to choose between chained algorithms and hybrid approaches for different use cases.
+
Comprehensive guide to scan (prefix sum) and reduce operations in GridWise. Explains the difference between exclusive
+ scan (first element is identity), inclusive scan (each element includes itself), and reduce (single output value).
+ Covers binary operations (Add, Min, Max), data type support (u32, i32, f32), API usage patterns with code examples,
+ and when to use each variant for optimal performance.
Complete documentation for GridWise's OneSweepSort implementation. Covers both key-only sorting and key-value pair
+ sorting with full payload support. Explains configurable sort direction (ascending/descending), supported data
+ types, buffer management strategies, and in-place versus temporary buffer approaches. Includes detailed API
+ documentation and performance characteristics across different input sizes and configurations.
Guide to binary operations used in GridWise's scan and reduce primitives. Documents available operations (Add, Min,
+ Max, Multiply) and their properties. Explains how to implement custom binary operations by extending the binop
+ interface, including implementation requirements, data type constraints, and validation patterns. Critical for users
+ who need domain-specific aggregation operations.
Best practices for allocating, managing, and optimizing GPU buffers in GridWise applications. Covers buffer creation
+ strategies, memory usage patterns, and how to minimize memory allocation overhead. Explains the relationship between
+ buffer sizes and performance, copy strategies for input/output, and how to handle edge cases with non-aligned input
+ lengths. Essential for building efficient GridWise applications.
Detailed explanation of timing mechanisms in GridWise for accurate performance measurement and benchmarking. Covers
+ both CPU timing (performance.now) and GPU timing (timestamp queries) approaches, their accuracy tradeoffs, and when
+ to use each. Explains warmup strategies, trial averaging, and how to interpret results across different hardware
+ configurations for reliable performance comparisons.
Detailed guide to GPU subgroups and their critical role in GridWise primitive performance. Explains what subgroups
+ are, how different GPU architectures have different subgroup sizes, and the performance benefits of subgroup
+ operations. Covers GridWise's approach to subgroup detection, optional subgroup acceleration, and fallback
+ strategies for hardware without subgroup support to maintain broad device compatibility.
Exploration of WebGPU WGSL built-in functions and how GridWise strategically selects and optimizes their use in
+ primitive implementations. Explains which built-ins provide the best performance for reduction operations,
+ aggregation patterns, and data movement. Covers vendor-specific optimizations and how to identify when built-in
+ usage versus hand-tuned WGSL code provides the best performance on different hardware.
Comprehensive guide to GridWise's approach for caching and reusing WebGPU objects (compute pipelines, bind groups,
+ buffer layouts) across multiple invocations. Explains how object caching reduces GPU state setup overhead and
+ improves throughput for repeated operations. Covers caching strategies for different primitive configurations,
+ memory management of cached objects, and invalidation patterns for long-running applications.
In-depth tutorial on implementing custom workgroup-level reduce functions in WGSL for integration with GridWise
+ primitives. Covers reduction patterns, memory synchronization with workgroup barriers, handling of non-power-of-2
+ workgroup sizes, and optimization techniques using subgroups where available. Includes complete code examples and
+ validation strategies for custom reduce operations.
diff --git a/docs/_posts/index.markdown b/docs/_posts/index.markdown
new file mode 100644
index 0000000..1e72ecf
--- /dev/null
+++ b/docs/_posts/index.markdown
@@ -0,0 +1,104 @@
+---
+layout: home
+permalink: /
+---
+
+
+
+Comprehensive guides and technical documentation for Gridwise WebGPU primitives.
+
+## Architecture
+
+Overview of Gridwise's system design, module structure, and how primitives are organized for extensibility and performance. Learn about the high-level organization of Gridwise components, including how different primitives (scan, reduce, sort) are implemented as modular, reusable units. Understand the architectural decisions that enable performance optimization while maintaining clean separation of concerns and ease of extension.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/architecture/){:target="_blank" class="doc-btn"}
+
+## Primitive Design
+
+Deep dive into the design principles behind Gridwise primitives with focus on single-pass chained algorithms for sort, scan, and reduce. Explores the tradeoffs between using subgroup instructions for maximum performance versus software emulation for broader compatibility. Covers memory bandwidth considerations, the lookback and fallback optimization techniques, and how to choose between chained algorithms and hybrid approaches for different use cases.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/primitive-design/){:target="_blank" class="doc-btn"}
+
+## Scan and Reduce
+
+Comprehensive guide to scan (prefix sum) and reduce operations in Gridwise. Explains the difference between exclusive scan (first element is identity), inclusive scan (each element includes itself), and reduce (single output value). Covers binary operations (Add, Min, Max), data type support (u32, i32, f32), API usage patterns with code examples, and when to use each variant for optimal performance.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/scan-and-reduce/){:target="_blank" class="doc-btn"}
+
+## Sort
+
+Complete documentation for Gridwise's OneSweepSort implementation. Covers both key-only sorting and key-value pair sorting with full payload support. Explains configurable sort direction (ascending/descending), supported data types, buffer management strategies, and in-place versus temporary buffer approaches. Includes detailed API documentation and performance characteristics across different input sizes and configurations.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/sort/){:target="_blank" class="doc-btn"}
+
+## Binary Operations
+
+Guide to binary operations used in Gridwise's scan and reduce primitives. Documents available operations (Add, Min, Max, Multiply) and their properties. Explains how to implement custom binary operations by extending the binop interface, including implementation requirements, data type constraints, and validation patterns. Critical for users who need domain-specific aggregation operations.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/binop/){:target="_blank" class="doc-btn"}
+
+## Buffer Management
+
+Best practices for allocating, managing, and optimizing GPU buffers in Gridwise applications. Covers buffer creation strategies, memory usage patterns, and how to minimize memory allocation overhead. Explains the relationship between buffer sizes and performance, copy strategies for input/output, and how to handle edge cases with non-aligned input lengths. Essential for building efficient Gridwise applications.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/buffer/){:target="_blank" class="doc-btn"}
+
+## Timing Strategy
+
+Detailed explanation of timing mechanisms in Gridwise for accurate performance measurement and benchmarking. Covers both CPU timing (performance.now) and GPU timing (timestamp queries) approaches, their accuracy tradeoffs, and when to use each. Explains warmup strategies, trial averaging, and how to interpret results across different hardware configurations for reliable performance comparisons.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/timing-strategy/){:target="_blank" class="doc-btn"}
+
+## Subgroup Strategy
+
+Detailed guide to GPU subgroups and their critical role in Gridwise primitive performance. Explains what subgroups are, how different GPU architectures have different subgroup sizes, and the performance benefits of subgroup operations. Covers Gridwise's approach to subgroup detection, optional subgroup acceleration, and fallback strategies for hardware without subgroup support to maintain broad device compatibility.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/subgroup-strategy/){:target="_blank" class="doc-btn"}
+
+## Built-ins Strategy
+
+Exploration of WebGPU WGSL built-in functions and how Gridwise strategically selects and optimizes their use in primitive implementations. Explains which built-ins provide the best performance for reduction operations, aggregation patterns, and data movement. Covers vendor-specific optimizations and how to identify when built-in usage versus hand-tuned WGSL code provides the best performance on different hardware.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/builtins-strategy/){:target="_blank" class="doc-btn"}
+
+## WebGPU Object Caching Strategy
+
+Comprehensive guide to Gridwise's approach for caching and reusing WebGPU objects (compute pipelines, bind groups, buffer layouts) across multiple invocations. Explains how object caching reduces GPU state setup overhead and improves throughput for repeated operations. Covers caching strategies for different primitive configurations, memory management of cached objects, and invalidation patterns for long-running applications.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/webgpu-object-caching-strategy/){:target="_blank" class="doc-btn"}
+
+## Writing a WebGPU WGSL Workgroup Reduce Function
+
+In-depth tutorial on implementing custom workgroup-level reduce functions in WGSL for integration with Gridwise primitives. Covers reduction patterns, memory synchronization with workgroup barriers, handling of non-power-of-2 workgroup sizes, and optimization techniques using subgroups where available. Includes complete code examples and validation strategies for custom reduce operations.
+
+[Read](https://gridwise-webgpu.github.io/gridwise/writing-a-webgpu-wgsl-workgroup-reduce-function/){:target="_blank" class="doc-btn"}
diff --git a/docs/docs.html b/docs/docs.html
index 0ac26be..4bc3af1 100644
--- a/docs/docs.html
+++ b/docs/docs.html
@@ -1,7 +1,7 @@
---
layout: home
title: Documentation
-permalink: /docs/
+permalink: /
---
@@ -42,7 +42,7 @@
Architecture
(scan, reduce, sort) are implemented as modular, reusable units. Understand the architectural decisions that enable
performance optimization while maintaining clean separation of concerns and ease of extension.
Covers binary operations (Add, Min, Max), data type support (u32, i32, f32), API usage patterns with code examples,
and when to use each variant for optimal performance.
types, buffer management strategies, and in-place versus temporary buffer approaches. Includes detailed API
documentation and performance characteristics across different input sizes and configurations.
interface, including implementation requirements, data type constraints, and validation patterns. Critical for users
who need domain-specific aggregation operations.
buffer sizes and performance, copy strategies for input/output, and how to handle edge cases with non-aligned input
lengths. Essential for building efficient GridWise applications.
to use each. Explains warmup strategies, trial averaging, and how to interpret results across different hardware
configurations for reliable performance comparisons.
operations. Covers GridWise's approach to subgroup detection, optional subgroup acceleration, and fallback
strategies for hardware without subgroup support to maintain broad device compatibility.
aggregation patterns, and data movement. Covers vendor-specific optimizations and how to identify when built-in
usage versus hand-tuned WGSL code provides the best performance on different hardware.
improves throughput for repeated operations. Covers caching strategies for different primitive configurations,
memory management of cached objects, and invalidation patterns for long-running applications.
workgroup sizes, and optimization techniques using subgroups where available. Includes complete code examples and
validation strategies for custom reduce operations.
\ No newline at end of file
diff --git a/docs/example.html b/docs/examples.html
similarity index 80%
rename from docs/example.html
rename to docs/examples.html
index 0130729..cf1946e 100644
--- a/docs/example.html
+++ b/docs/examples.html
@@ -1,56 +1,43 @@
---
layout: home
title: Examples
-permalink: /examples/
+permalink: /examples-guide/
---
-
-
Interactive examples demonstrating WebGPU primitives and operations.
-
+
Explore practical examples demonstrating how to use Gridwise WebGPU primitives for scan, sort, and reduce
+ operations. Each example includes code snippets, explanations of key concepts, and links to source code and
+ performance benchmarks.
Scan Example
Compact example that demonstrates exclusive vs inclusive scans using DLDFScan: how to choose scanType, build the
@@ -90,9 +77,9 @@
+class="doc-btn">Performance
\ No newline at end of file
diff --git a/docs/index.markdown b/docs/index.markdown
deleted file mode 100644
index d57b7a8..0000000
--- a/docs/index.markdown
+++ /dev/null
@@ -1,6 +0,0 @@
----
-layout: home
-permalink: /
----
-
-Gridwise provides WebGPU compute primitives in JavaScript. Its current supported primitives are reduce, scan, and sort, and it is built atop infrastructure to make the development and performance analysis of future primitives as straightforward as possible. Gridwise was developed during a sabbatical year at Google from August 2024--August 2025.
From 2d25b13770fa58287c06543038206e5cb5fa2fab Mon Sep 17 00:00:00 2001
From: jayshah1819
Date: Mon, 24 Nov 2025 19:37:56 -0500
Subject: [PATCH 11/17] small changes(location)
---
docs/_includes/header.html | 6 +++---
docs/docs.html | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/docs/_includes/header.html b/docs/_includes/header.html
index 8662f35..1fac2cb 100644
--- a/docs/_includes/header.html
+++ b/docs/_includes/header.html
@@ -2,7 +2,7 @@
aggregation patterns, and data movement. Covers vendor-specific optimizations and how to identify when built-in
usage versus hand-tuned WGSL code provides the best performance on different hardware.
improves throughput for repeated operations. Covers caching strategies for different primitive configurations,
memory management of cached objects, and invalidation patterns for long-running applications.
workgroup sizes, and optimization techniques using subgroups where available. Includes complete code examples and
validation strategies for custom reduce operations.
(scan, reduce, sort) are implemented as modular, reusable units. Understand the architectural decisions that enable
performance optimization while maintaining clean separation of concerns and ease of extension.
Covers binary operations (Add, Min, Max), data type support (u32, i32, f32), API usage patterns with code examples,
and when to use each variant for optimal performance.
types, buffer management strategies, and in-place versus temporary buffer approaches. Includes detailed API
documentation and performance characteristics across different input sizes and configurations.
interface, including implementation requirements, data type constraints, and validation patterns. Critical for users
who need domain-specific aggregation operations.
buffer sizes and performance, copy strategies for input/output, and how to handle edge cases with non-aligned input
lengths. Essential for building efficient GridWise applications.
to use each. Explains warmup strategies, trial averaging, and how to interpret results across different hardware
configurations for reliable performance comparisons.
operations. Covers GridWise's approach to subgroup detection, optional subgroup acceleration, and fallback
strategies for hardware without subgroup support to maintain broad device compatibility.
aggregation patterns, and data movement. Covers vendor-specific optimizations and how to identify when built-in
usage versus hand-tuned WGSL code provides the best performance on different hardware.
improves throughput for repeated operations. Covers caching strategies for different primitive configurations,
memory management of cached objects, and invalidation patterns for long-running applications.
workgroup sizes, and optimization techniques using subgroups where available. Includes complete code examples and
validation strategies for custom reduce operations.