diff --git a/docs/README.md b/docs/README.md
index f2d56079..3712710b 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -2,7 +2,7 @@
 
 deepx提出了一种原生分布式自动并行的训推一体化的深度学习框架。
 
-通过严格遵循存算控分离的思想，框架能够在超大规模场景下实现自动分布式训练与推理的一体化，有效应对掉卡、掉节点等异常情况，提高计算资源的利用率和系统的可靠性。
+通过严格遵循存、算、通信、控制分离的思想，框架能够在超大规模场景下实现自动分布式训练与推理的一体化，有效应对掉卡、掉节点等异常情况，提高计算资源的利用率和系统的可靠性。
 
 通过开放式计算调度框架，可供不同背景的厂家、开发者，按统一协议，适配和开发自己的算子引擎、算存引擎、网络引擎等。
 
@@ -99,6 +99,7 @@ $$ \frac{T_{compute}}{T_{communication}} > 10 $$
 
 op的计算时长/通信时长之比 需要大于10
 
+
 因此，我们需要使用算子融合的方式，提升算子的总时长，降低调度带来的通信时长。
 
 
diff --git a/executor/cpp-common/CMakeLists.txt b/executor/cpp-common/CMakeLists.txt
index 0d48cb85..cd28bf54 100644
--- a/executor/cpp-common/CMakeLists.txt
+++ b/executor/cpp-common/CMakeLists.txt
@@ -23,11 +23,14 @@ add_library(deepx_common SHARED
  
 find_package(yaml-cpp REQUIRED)
 
+set(YAMLCPP_LIB "")
 if (TARGET yaml-cpp::yaml-cpp)
-  target_link_libraries(deepx_common PUBLIC yaml-cpp::yaml-cpp)
+  set(YAMLCPP_LIB yaml-cpp::yaml-cpp)
 else()
-  target_link_libraries(deepx_common PUBLIC yaml-cpp)
+  set(YAMLCPP_LIB yaml-cpp)
 endif()
+  
+target_link_libraries(deepx_common PUBLIC ${YAMLCPP_LIB})
 
 target_include_directories(deepx_common PUBLIC
     $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/src>
diff --git a/executor/mem-cuda/README.md b/executor/mem-cuda/README.md
new file mode 100644
index 00000000..2f7fa359
--- /dev/null
+++ b/executor/mem-cuda/README.md
@@ -0,0 +1,91 @@
+# mem-cuda 方案草案
+
+本目录用于设计/实现单机多进程的 GPU Tensor 统一存储面（CUDA IPC），并通过 Redis 做 name → IPC handle 的集中注册与控制。
+
+## 目标
+- 单机内多进程共享**可命名**Tensor（同名即同一块 GPU 内存）。
+- 通过 Redis 维护 name、shape、dtype、device、IPC handle 等元信息。
+- 通过 Redis List 接收创建/获取/删除指令，实现统一的控制面。
+
+## 设计概述
+### 1) Redis 元数据（KV/Hash）
+对每个 Tensor 名称建立一个 Hash：
+- `name`: string
+- `dtype`: string（如 f32/i8 等）
+- `shape`: string/json（如 "[2,3,4]")
+- `device`: int（GPU id）
+- `bytes`: int
+- `ipc_handle`: binary/base64
+- `owner_pid`: int
+- `refcount`: int
+- `ctime/mtime`: int64
+
+### 2) Redis 指令队列（List）
+用 list 作为控制通道（生产者/消费者）：
+- list key: `tensor:cmd`
+- 指令格式建议为 JSON：
+  ```json
+  {"op":"create|get|delete", "name":"X", "dtype":"f32", "shape":[2,3], "device":0}
+  ```
+- 处理流程：
+  - **create**: 分配 GPU 内存 → `cudaIpcGetMemHandle` → 写入 Hash
+  - **get**: 读取 Hash → `cudaIpcOpenMemHandle`
+  - **delete**: `refcount--`，为 0 时释放 GPU 内存并删除 Hash
+
+### 3) CUDA IPC 基本流程
+- `cudaIpcGetMemHandle`：将 `cudaMalloc` 的指针导出为 handle
+- `cudaIpcOpenMemHandle`：其他进程映射同一块 GPU 内存
+- 仅限**同机**；需保证 device id 一致
+- 跨 stream 写读，需要显式同步（事件/流同步策略）
+
+## 显存池方案
+你的需求是：**参考 PyTorch 的显存池管理**。这里给出两种落地路线：
+
+### 方案 A：接入成熟开源显存池（推荐）
+可选项目：
+- **RMM (RAPIDS Memory Manager)**
+  - 优点：成熟、支持 pool/async allocator、统计完善
+  - 适合：对稳定性与可观察性要求高的生产环境
+- **CNMeM (NVIDIA)**
+  - 优点：轻量、易集成
+  - 适合：需要最小依赖的场景
+- **CUB caching allocator**
+  - 优点：性能好、实现简单
+  - 适合：希望直接嵌入 CUDA 代码路径
+
+> 选择建议：优先 RMM；想保持最小依赖可用 CNMeM 或 CUB。
+
+### 方案 B：自研简化版显存池（AI 方案）
+如果不引入外部依赖，可先实现一个简化版池：
+- 维护按 size 分桶的 free-list（如 1MB、2MB、4MB…）
+- 分配时优先复用空闲块，不足时 `cudaMalloc` 新块
+- 回收时挂回 free-list，不立刻 `cudaFree`
+- 支持 `recordStream` / `event` 延迟回收，避免跨流释放风险
+
+**建议先实现 MVP**：
+1) 单 GPU
+2) 只支持 `create/get/delete`
+3) dtype 限定 f32
+4) 单进程先跑通，再放开多进程 + IPC
+
+## 安全与一致性
+- Redis 写入与 refcount 需要原子操作（Lua 脚本/事务）
+- 崩溃恢复：定期清理 owner_pid 不存在的条目
+- IPC handle 需与 device id 配对，否则会映射失败
+
+## 目录建议
+```
+mem-cuda/
+  README.md
+  doc/
+  src/
+    registry/        # Redis 元数据与命令处理
+    allocator/       # 显存池实现或适配层
+    ipc/             # cudaIpcGet/Open 封装
+```
+
+## 后续工作清单
+- [ ] 选定显存池方案（RMM / CNMeM / CUB / 自研）
+- [ ] 定义 Redis 数据结构与命令协议
+- [ ] 编写 IPC 封装与单机多进程 demo
+- [ ] 建立错误恢复与 GC 机制
diff --git a/executor/op-mem-cuda/docs/00_quickstart.md b/executor/op-mem-cuda/docs/00_quickstart.md
deleted file mode 100644
index 05998e77..00000000
--- a/executor/op-mem-cuda/docs/00_quickstart.md
+++ /dev/null
@@ -1,119 +0,0 @@
-# Getting Started With CuTe
-
-CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.
-
-The core abstraction of CuTe are the hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.
-
-## System Requirements
-
-CuTe shares CUTLASS 3.x's software requirements,
-including NVCC with a C++17 host compiler.
-
-## Knowledge prerequisites
-
-CuTe is a CUDA C++ header-only library.  It requires C++17
-(the revision of the C++ Standard that was released in 2017).
-
-Throughout this tutorial, we assume intermediate C++ experience.
-For example, we assume that readers know
-how to read and write templated functions and classes, and
-how to use the `auto` keyword to deduce a function's return type.
-We will be gentle with C++ and explain some things
-that you might already know.
-
-We also assume intermediate CUDA experience.
-For example, readers must know
-the difference between device and host code,
-and how to launch kernels.
-
-## Building Tests and Examples
-
-CuTe's tests and examples build and run as part of CUTLASS's normal build process.
-
-CuTe's unit tests live in the [`test/unit/cute`](../../../test/unit/cute) subdirectory.
-
-CuTe's examples live in the [`examples/cute`](../../../examples/cute) subdirectory.
-
-## Library Organization
-
-CuTe is a header-only C++ library, so there is no source code that needs building. Library headers are contained within the top level [`include/cute`](../../../include/cute) directory, with components of the library grouped by directories that represent their semantics.
-
-|        Directory       |        Contents        |
-|------------------------|------------------------|
-| [`include/cute`](../../../include/cute) | Each header in the top level corresponds to one of the fundamental building blocks of CuTe, such as [`Layout`](../../../include/cute/layout.hpp) and [`Tensor`](../../../include/cute/tensor.hpp). |
-| [`include/cute/container`](../../../include/cute/container) | Implementations of STL-like objects, such as tuple, array, and aligned array.  |
-| [`include/cute/numeric`](../../../include/cute/numeric) | Fundamental numeric data types that include nonstandard floating-point types, nonstandard integer types, complex numbers, and integer sequence.  |
-| [`include/cute/algorithm`](../../../include/cute/algorithm) | Implementations of utility algorithms such as copy, fill, and clear that automatically leverage architecture-specific features if available. |
-| [`include/cute/arch`](../../../include/cute/arch) | Wrappers for architecture-specific matrix-matrix multiply and copy instructions. |
-| [`include/cute/atom`](../../../include/cute/atom) | Meta-information for instructions in `arch` and utilities like partitioning and tiling.
-
-## Tutorial
-
-This directory contains a CuTe tutorial in Markdown format.
-The file
-[`0x_gemm_tutorial.md`](./0x_gemm_tutorial.md)
-explains how to implement dense matrix-matrix multiply using CuTe components.
-It gives a broad overview of CuTe and thus would be a good place to start.
-
-Other files in this directory discuss specific parts of CuTe.
-
-* [`01_layout.md`](./01_layout.md) describes `Layout`, CuTe's core abstraction.
-
-* [`02_layout_algebra.md`](./02_layout_algebra.md) describes more advanced `Layout` operations and the CuTe layout algebra.
-
-* [`03_tensor.md`](./03_tensor.md) describes `Tensor`,
-  a multidimensional array abstraction which composes `Layout`
-  with an array of data.
-
-* [`04_algorithms.md`](./04_algorithms.md) summarizes CuTe's
-  generic algorithms that operate on `Tensor`s.
-
-* [`0t_mma_atom.md`](./0t_mma_atom.md) demonstrates CuTe's meta-information and interface to our GPUs'
-  architecture-specific Matrix Multiply-Accumulate (MMA) instructions.
-
-* [`0x_gemm_tutorial.md`](./0x_gemm_tutorial.md) walks through building a GEMM from scratch using CuTe.
-
-* [`0y_predication.md`](./0y_predication.md) explains what to do
-  if a tiling doesn't fit evenly into a matrix.
-
-* [`0z_tma_tensors.md`](./0z_tma_tensors.md) explains an advanced `Tensor` type that CuTe uses to support TMA loads and stores.
-
-## Quick Tips
-
-### How do I print CuTe objects on host or device?
-
-The `cute::print` function has overloads for almost all CuTe types, including Pointers, Integers, Strides, Shapes, Layouts, and Tensors.  When in doubt, try calling `print` on it.
-
-CuTe's print functions work on either host or device.
-Note that on device, printing is expensive.
-Even just leaving print code in place on device,
-even if it is never called
-(e.g., printing in an `if` branch that is not taken at run time),
-may generate slower code.
-Thus, be sure to remove code that prints on device after debugging.
-
-You might also only want to print on thread 0 of each threadblock, or threadblock 0 of the grid.  The `thread0()` function returns true only for global thread 0 of the kernel, that is, for thread 0 of threadblock 0.  A common idiom for printing CuTe objects to print only on global thread 0.
-
-```c++
-if (thread0()) {
-  print(some_cute_object);
-}
-```
-
-Some algorithms depend on some thread or threadblock,
-so you may need to print on threads or threadblocks other than zero.
-The header file
-[`cute/util/debug.hpp`](../../../include/cute/util/debug.hpp),
-among other utilities,
-includes the function `bool thread(int tid, int bid)`
-that returns `true` if running on thread `tid` and threadblock `bid`.
-
-#### Other output formats
-
-Some CuTe types have special printing functions that use a different output format.
-
-The `cute::print_layout` function will display any rank-2 layout in a plain test table. This is excellent for visualizing the map from coordinates to indices.
-
-The `cute::print_tensor` function will display any rank-1, rank-2, rank-3, or rank-4 tensor in a plain text multidimensional table. The values of the tensor are printed so you can verify the tile of data is what you expect after a copy, for example.
-
-The `cute::print_latex` function will print LaTeX commands that you can use to build a nicely formatted and colored tables via `pdflatex`. This work for `Layout`, `TiledCopy`, and `TiledMMA`, which can be very useful to get a sense of layout patterns and partitioning patterns within CuTe.
diff --git a/executor/op-mem-cuda/docs/01_layout.md b/executor/op-mem-cuda/docs/01_layout.md
deleted file mode 100644
index 2f5ba5a5..00000000
--- a/executor/op-mem-cuda/docs/01_layout.md
+++ /dev/null
@@ -1,537 +0,0 @@
-# CuTe Layouts
-
-This document describes `Layout`, CuTe's core abstraction.
-Fundamentally, a `Layout` maps from coordinate space(s)
-to an index space.
-
-`Layout`s present a common interface to multidimensional array access
-that abstracts away the details of how the array's elements are organized in memory.
-This lets users write algorithms that access multidimensional arrays generically,
-so that layouts can change, without users' code needing to change. For example, a row-major MxN layout and a column-major MxN layout can be treated identically in software.
-
-CuTe also provides an "algebra of `Layout`s."
-`Layout`s can be combined and manipulated
-to construct more complicated layouts
-and to tile layouts across other layouts.
-This can help users do things like partition layouts of data over layouts of threads.
-
-## Fundamental Types and Concepts
-
-### Integers
-
-CuTe makes great use of dynamic (known only at run-time) and static (known at compile-time) integers.
-
-* Dynamic integers (or "run-time integers") are just ordinary integral types like `int` or `size_t` or `uint16_t`. Anything that is accepted by `std::is_integral<T>` is considered a dynamic integer in CuTe.
-
-* Static integers (or "compile-time integers") are instantiations of types like `std::integral_constant<Value>`. These types encode the value as a `static constexpr` member. They also support casting to their underlying dynamic types, so they can be used in expressions with dynamic integers. CuTe defines its own CUDA-compatibe static integer types `cute::C<Value>` along with overloaded math operators so that math on static integers results in static integers. CuTe defines shortcut aliases `Int<1>`, `Int<2>`, `Int<3>` and `_1`, `_2`, `_3` as conveniences, which you should see often within examples.
-
-CuTe attempts to handle static and dynamic integers identically. In the examples that follow, all dynamic integers could be replaced with static integers and vice versa. When we say "integer" in CuTe, we almost always mean a static OR dynamic integer.
-
-CuTe provides a number of traits to work with integers.
-* `cute::is_integral<T>`: Checks whether `T` is a static or dynamic integer type.
-* `cute::is_std_integral<T>`: Checks whether `T` is a dynamic integer type. Equivalent to `std::is_integral<T>`.
-* `cute::is_static<T>`: Checks whether `T` is an empty type (so instantiations cannot depend on any dynamic information). Equivalent to `std::is_empty`.
-* `cute::is_constant<N,T>`: Checks that `T` is a static integer AND its value is equivalent to `N`.
-
-See the [`integral_constant` implementations](../../../include/cute/numeric/integral_constant.hpp) for more information.
-
-### Tuple
-
-A tuple is a finite ordered list of zero or more elements.
-The [`cute::tuple` class](../../../include/cute/container/tuple.hpp) behaves like `std::tuple`, but works on device and host. It imposes restrictions on its template arguments and strips down the implementation for performance and simplicity.
-
-### IntTuple
-
-CuTe defines the IntTuple concept as either an integer, or a tuple of IntTuples. Note the recursive definition.
-In C++, we define [operations on `IntTuple`](../../../include/cute/int_tuple.hpp).
-
-Examples of `IntTuple`s include:
-* `int{2}`, the dynamic integer 2.
-* `Int<3>{}`, the static integer 3.
-* `make_tuple(int{2}, Int<3>{})`, the tuple of dynamic-2, and static-3.
-* `make_tuple(uint16_t{42}, make_tuple(Int<1>{}, int32_t{3}), Int<17>{})`, the tuple of dynamic-42, tuple of static-1 and dynamic-3, and static-17.
-
-CuTe reuses the `IntTuple` concept for many different things,
-including Shape, Stride, Step, and Coord
-(see [`include/cute/layout.hpp`](../../../include/cute/layout.hpp)).
-
-Operations defined on `IntTuple`s include the following.
-
-* `rank(IntTuple)`: The number of elements in an `IntTuple`. A single integer has rank 1, and a tuple has rank `tuple_size`.
-
-* `get<I>(IntTuple)`: The `I`th element of the `IntTuple`, with `I < rank`. For single integers, `get<0>` is just that integer.
-
-* `depth(IntTuple)`: The number of hierarchical `IntTuple`s. A single integer has depth 0, a tuple of integers has depth 1, a tuple that contains a tuple of integers has depth 2, etc.
-
-* `size(IntTuple)`: The product of all elements of the `IntTuple`.
-
-We write `IntTuple`s with parentheses to denote the hierarchy. For example, `6`, `(2)`, `(4,3)`, and `(3,(6,2),8)` are all `IntTuple`s.
-
-### Shapes and Strides
-
-Both `Shape` and `Stride` are `IntTuple` concepts.
-
-### Layout
-
-A `Layout` is a tuple of (`Shape`, `Stride`).
-Semantically, it implements a mapping from
-any coordinate within the Shape to an index via the Stride.
-
-### Tensor
-
-A `Layout` can be composed with data -- e.g., a pointer or an array -- to create a `Tensor`. The index generated by the `Layout` is used to subscript an iterator to retrieve the appropriate data. For details on `Tensor`, please refer to the
-[`Tensor` section of the tutorial](./03_tensor.md).
-
-## Layout Creation and Use
-
-A `Layout` is a pair of `IntTuple`s: the `Shape` and the `Stride`. The first element defines the abstract *shape* of the `Layout`, and the second element defines the *strides*, which map from coordinates within the shape to the index space.
-
-We define many operations on `Layout`s analogous to those defined on `IntTuple`.
-
-* `rank(Layout)`: The number of modes in a `Layout`. Equivalent to the tuple size of the `Layout`'s shape.
-
-* `get<I>(Layout)`: The `I`th sub-layout of the `Layout`, with `I < rank`.
-
-* `depth(Layout)`: The depth of the `Layout`'s shape. A single integer has depth 0, a tuple of integers has depth 1, a tuple of tuples of integers has depth 2, etc.
-
-* `shape(Layout)`: The shape of the `Layout`.
-
-* `stride(Layout)`: The stride of the `Layout`.
-
-* `size(Layout)`: The size of the `Layout` function's domain.  Equivalent to `size(shape(Layout))`.
-
-* `cosize(Layout)`: The size of the `Layout` function's codomain (not necessarily the range). Equivalent to `A(size(A) - 1) + 1`.
-
-### Hierarchical access functions
-
-`IntTuple`s and `Layout`s can be arbitrarily nested.
-For convenience, we define versions of some of the above functions
-that take a sequence of integers, instead of just one integer.
-This makes it possible to access elements
-inside of nested `IntTuple` or `Layout` more easily.
-For example, we permit `get<I...>(x)`, where `I...` is a "C++ parameter pack" that denotes zero or more (integer) template arguments. These hierarchical access functions include the following.
-
-* `get<I0,I1,...,IN>(x) := get<IN>(...(get<I1>(get<I0>(x)))...)`. Extract the `IN`th of the ... of the `I1`st of the `I0`th element of `x`.
-
-* `rank<I...>(x)  := rank(get<I...>(x))`. The rank of the `I...`th element of `x`.
-
-* `depth<I...>(x) := depth(get<I...>(x))`. The depth of the `I...`th element of `x`.
-
-* `shape<I...>(x)  := shape(get<I...>(x))`. The shape of the `I...`th element of `x`.
-
-* `size<I...>(x)  := size(get<I...>(x))`. The size of the `I...`th element of `x`.
-
-In the following examples, you'll see use of `size<0>` and `size<1>` to determine loops bounds for the 0th and 1st mode of a layout or tensor.
-
-### Constructing a Layout
-
-A `Layout` can be constructed in many different ways.
-It can include any combination of compile-time (static) integers
-or run-time (dynamic) integers.
-
-```c++
-Layout s8 = make_layout(Int<8>{});
-Layout d8 = make_layout(8);
-
-Layout s2xs4 = make_layout(make_shape(Int<2>{},Int<4>{}));
-Layout s2xd4 = make_layout(make_shape(Int<2>{},4));
-
-Layout s2xd4_a = make_layout(make_shape (Int< 2>{},4),
-                             make_stride(Int<12>{},Int<1>{}));
-Layout s2xd4_col = make_layout(make_shape(Int<2>{},4),
-                               LayoutLeft{});
-Layout s2xd4_row = make_layout(make_shape(Int<2>{},4),
-                               LayoutRight{});
-
-Layout s2xh4 = make_layout(make_shape (2,make_shape (2,2)),
-                           make_stride(4,make_stride(2,1)));
-Layout s2xh4_col = make_layout(shape(s2xh4),
-                               LayoutLeft{});
-```
-
-The `make_layout` function returns a `Layout`.
-It deduces the types of the function's arguments and returns a `Layout` with the appropriate template arguments.
-Similarly, the `make_shape` and `make_stride` functions
-return a `Shape` resp. `Stride`.
-CuTe often uses these `make_*` functions
-due to restrictions around constructor template argument deduction (CTAD) and to avoid having to repeat static or dynamic integer types.
-
-When the `Stride` argument is omitted, it is generated from the provided `Shape` with `LayoutLeft` as default. The `LayoutLeft` tag constructs strides as an exclusive prefix product of the `Shape` from left to right, without regard to the `Shape`'s hierarchy. This can be considered a "generalized column-major stride generation". The `LayoutRight` tag constructs strides as an exclusive prefix product of the `Shape` from right to left, without regard to the `Shape`'s hierarchy. For shapes of depth one, this can be considered a "row-major stride generation", but for hierarchical shapes the resulting strides may be surprising. For example, the strides of `s2xh4` above could be generated with `LayoutRight`.
-
-Calling `print` on each layout above results in the following
-
-```
-s8        :  _8:_1
-d8        :  8:_1
-s2xs4     :  (_2,_4):(_1,_2)
-s2xd4     :  (_2,4):(_1,_2)
-s2xd4_a   :  (_2,4):(_12,_1)
-s2xd4_col :  (_2,4):(_1,_2)
-s2xd4_row :  (_2,4):(4,_1)
-s2xh4     :  (2,(2,2)):(4,(2,1))
-s2xh4_col :  (2,(2,2)):(_1,(2,4))
-```
-
-The `Shape:Stride` notation is used quite often for `Layout`. The `_N` notation is shorthand for a static integer while other integers are dynamic integers. Observe that both `Shape` and `Stride` may be composed of both static and dynamic integers.
-
-Also note that the `Shape` and `Stride` are assumed to be *congruent*. That is, `Shape` and `Stride` have the same tuple profiles. For every integer in `Shape`, there is a corresponding integer in `Stride`. This can be asserted with
-```cpp
-static_assert(congruent(my_shape, my_stride));
-```
-
-### Using a Layout
-
-The fundamental use of a `Layout` is to map between coordinate space(s) defined by the `Shape` and an index space defined by the `Stride`. For example, to print an arbitrary rank-2 layout in a 2-D table, we can write the function
-
-```c++
-template <class Shape, class Stride>
-void print2D(Layout<Shape,Stride> const& layout)
-{
-  for (int m = 0; m < size<0>(layout); ++m) {
-    for (int n = 0; n < size<1>(layout); ++n) {
-      printf("%3d  ", layout(m,n));
-    }
-    printf("\n");
-  }
-}
-```
-
-which produces the following output for the above examples.
-
-```
-> print2D(s2xs4)
-  0    2    4    6
-  1    3    5    7
-> print2D(s2xd4_a)
-  0    1    2    3
- 12   13   14   15
-> print2D(s2xh4_col)
-  0    2    4    6
-  1    3    5    7
-> print2D(s2xh4)
-  0    2    1    3
-  4    6    5    7
-```
-
-We can see static, dynamic, row-major, column-major, and hierarchical layouts printed here. The statement `layout(m,n)` provides the mapping of
-the logical 2-D coordinate (m,n) to the 1-D index.
-
-Interestingly, the `s2xh4` example isn't row-major or column-major. Furthermore, it has three modes but is still interpreted as rank-2 and we're using a 2-D coordinate. Specifically, `s2xh4` has a 2-D multi-mode in the second mode, but we're still able to use a 1-D coordinate for that mode. More on this in the next section, but first we can generalize this another step. Let's use a 1-D coordinate and treat all of the modes of each layout as a single multi-mode.  For instance, the following `print1D` function
-
-```c++
-template <class Shape, class Stride>
-void print1D(Layout<Shape,Stride> const& layout)
-{
-  for (int i = 0; i < size(layout); ++i) {
-    printf("%3d  ", layout(i));
-  }
-}
-```
-
-produces the following output for the above examples.
-
-```
-> print1D(s2xs4)
-  0    1    2    3    4    5    6    7
-> print1D(s2xd4_a)
-  0   12    1   13    2   14    3   15
-> print1D(s2xh4_col)
-  0    1    2    3    4    5    6    7
-> print1D(s2xh4)
-  0    4    2    6    1    5    3    7
-```
-
-Any multi-mode of a layout, including the entire layout itself, can accept a 1-D coordinate. More on this in the following sections.
-
-CuTe provides more printing utilities for visualizing Layouts. The `print_layout` function produces a formatted 2-D table of the Layout's mapping.
-
-```text
-> print_layout(s2xh4)
-(2,(2,2)):(4,(2,1))
-      0   1   2   3
-    +---+---+---+---+
- 0  | 0 | 2 | 1 | 3 |
-    +---+---+---+---+
- 1  | 4 | 6 | 5 | 7 |
-    +---+---+---+---+
-```
-
-The `print_latex` function generates LaTeX that can be compiled with `pdflatex` into a color-coded vector graphics image of the same 2-D table.
-
-### Vector Layouts
-
-We define a vector as any `Layout` with `rank == 1`.
-For example, the layout `8:1` can be interpreted as an 8-element vector whose indices are contiguous.
-
-```
-Layout:  8:1
-Coord :  0  1  2  3  4  5  6  7
-Index :  0  1  2  3  4  5  6  7
-```
-
-Similarly,
-the layout `8:2` can be interpreted as an 8-element vector where the indices of the elements are strided by `2`.
-
-```
-Layout:  8:2
-Coord :  0  1  2  3  4  5  6  7
-Index :  0  2  4  6  8 10 12 14
-```
-
-By the above rank-1 definition, we *also* interpret layout `((4,2)):((2,1))` as a vector, since its shape is rank-1. The inner shape looks like a 4x2 row-major matrix, but the extra pair of parenthesis suggest we can interpret those two modes as a 1-D 8-element vector. The strides tell us that the first `4` elements are strided by `2` and then there are `2` of those first elements strided by `1`.
-
-```
-Layout:  ((4,2)):((2,1))
-Coord :  0  1  2  3  4  5  6  7
-Index :  0  2  4  6  1  3  5  7
-```
-
-We can see the second set of `4` elements are duplicates of the first `4` with an extra stride of `1`.
-
-Consider the layout `((4,2)):((1,4))`. Again, it's `4` elements strided by `1` and then `2` of those first elements strided by `4`.
-
-```
-Layout:  ((4,2)):((1,4))
-Coord :  0  1  2  3  4  5  6  7
-Index :  0  1  2  3  4  5  6  7
-```
-
-As a function from integers to integers, it's identical to `8:1`. It's the identity function.
-
-### Matrix examples
-
-Generalizing, we define a matrix as any `Layout` that is rank-2. For example,
-
-```
-Shape :  (4,2)
-Stride:  (1,4)
-  0   4
-  1   5
-  2   6
-  3   7
-```
-
-is a 4x2 column-major layout with stride-1 down the columns and stride-4 across the rows, and
-
-```
-Shape :  (4,2)
-Stride:  (2,1)
-  0   1
-  2   3
-  4   5
-  6   7
-```
-
-is a 4x2 row-major layout with stride-2 down the columns and stride-1 across the rows. Majorness is simply which mode has stride-1.
-
-Just like the vector layouts, each of the modes of the matrix can also be split into *multi-modes*.
-This lets us express more layouts beyond just row-major and column-major. For example,
-
-```
-Shape:  ((2,2),2)
-Stride: ((4,1),2)
-  0   2
-  4   6
-  1   3
-  5   7
-```
-
-is also logically 4x2, with stride-2 across the rows but a multi-stride down the columns. The first `2` elements down the column have a stride of `4` and then there is a copy of those with stride-1. Since this layout is logically 4x2,
-like the column-major and row-major examples above,
-we can _still_ use 2-D coordinates to index into it.
-
-## Layout Concepts
-
-In this section, we'll introduce the coordinate sets that `Layout`s accept and how the coordinate mappings and index mappings are computed.
-
-### Layout compatibility
-
-We say that layout A is *compatible* with layout B if the shape of A is compatible with the shape of B.
-Shape A is compatible with shape B if
-
-* the size of A is equal to the size of B and
-* all coordinates within A are valid coordinates within B.
-
-For example:
-* Shape 24 is NOT compatible with Shape 32.
-* Shape 24 is compatible with Shape (4,6).
-* Shape (4,6) is compatible with Shape ((2,2),6).
-* Shape ((2,2),6) is compatible with Shape ((2,2),(3,2)).
-* Shape 24 is compatible with Shape ((2,2),(3,2)).
-* Shape 24 is compatible with Shape ((2,3),4).
-* Shape ((2,3),4) is NOT compatible with Shape ((2,2),(3,2)).
-* Shape ((2,2),(3,2)) is NOT compatible with Shape ((2,3),4).
-* Shape 24 is compatible with Shape (24).
-* Shape (24) is NOT compatible with Shape 24.
-* Shape (24) is NOT compatible with Shape (4,6).
-
-That is, *compatible* is a weak partial order on Shapes as it is reflexive, antisymmetric, and transitive.
-
-### Layouts Coordinates
-
-With the notion of compatibility above, we emphasize that every `Layout` accepts multiple kinds of coordinates. Every `Layout` accepts coordinates for any `Shape` that is compatible with it. CuTe provides mappings between these sets of coordinates via a colexicographical order.
-
-Thus, all Layouts provide two fundamental mappings:
-
-* the map from an input coordinate to the corresponding natural coordinate via the `Shape`, and
-* the map from a natural coordinate to the index via the `Stride`.
-
-#### Coordinate Mapping
-
-The map from an input coordinate to a natural coordinate is the application of a colexicographical order (reading right to left, instead of "lexicographical," which reads left to right) within the `Shape`.
-
-Take the shape `(3,(2,3))`, for example. This shape has three coordinate sets: the 1-D coordinates, the 2-D coordinates, and the natural (h-D) coordinates.
-
-|  1-D  |   2-D   |   Natural   | |  1-D  |   2-D   |       Natural   |
-| ----- | ------- | ----------- |-| ----- | ------- | ----------- |
-|  `0`  | `(0,0)` | `(0,(0,0))` | |  `9`  | `(0,3)` | `(0,(1,1))` |
-|  `1`  | `(1,0)` | `(1,(0,0))` | | `10`  | `(1,3)` | `(1,(1,1))` |
-|  `2`  | `(2,0)` | `(2,(0,0))` | | `11`  | `(2,3)` | `(2,(1,1))` |
-|  `3`  | `(0,1)` | `(0,(1,0))` | | `12`  | `(0,4)` | `(0,(0,2))` |
-|  `4`  | `(1,1)` | `(1,(1,0))` | | `13`  | `(1,4)` | `(1,(0,2))` |
-|  `5`  | `(2,1)` | `(2,(1,0))` | | `14`  | `(2,4)` | `(2,(0,2))` |
-|  `6`  | `(0,2)` | `(0,(0,1))` | | `15`  | `(0,5)` | `(0,(1,2))` |
-|  `7`  | `(1,2)` | `(1,(0,1))` | | `16`  | `(1,5)` | `(1,(1,2))` |
-|  `8`  | `(2,2)` | `(2,(0,1))` | | `17`  | `(2,5)` | `(2,(1,2))` |
-
-Each coordinate into the shape `(3,(2,3))` has two *equivalent* coordinates and all equivalent coordinates map to the same natural coordinate. To emphasize again, because all of the above coordinates are valid inputs, a Layout with Shape `(3,(2,3))` can be used as if it is a 1-D array of 18 elements by using the 1-D coordinates, a 2-D matrix of 3x6 elements by using the 2-D coordinates, or a h-D tensor of 3x(2x3) elements by using the h-D (natural) coordinates.
-
-The previous 1-D print demonstrates how CuTe identifies 1-D coordinates with a colexicographical ordering of 2-D coordinates.  Iterating from `i = 0` to `size(layout)` and indexing into our layout with the single integer coordinate `i`, traverses the 2-D coordinates in this "generalized-column-major" order, even if the layout maps coordinates to indices in a row-major or more complex fashion.
-
-The function `cute::idx2crd(idx, shape)` is responsible for the coordinate mapping. It will take any coordinate within the shape and compute the equivalent natural coordinate for that shape.
-```cpp
-auto shape = Shape<_3,Shape<_2,_3>>{};
-print(idx2crd(   16, shape));                                // (1,(1,2))
-print(idx2crd(_16{}, shape));                                // (_1,(_1,_2))
-print(idx2crd(make_coord(   1,5), shape));                   // (1,(1,2))
-print(idx2crd(make_coord(_1{},5), shape));                   // (_1,(1,2))
-print(idx2crd(make_coord(   1,make_coord(1,   2)), shape));  // (1,(1,2))
-print(idx2crd(make_coord(_1{},make_coord(1,_2{})), shape));  // (_1,(1,_2))
-```
-
-#### Index Mapping
-
-The map from a natural coordinate to an index is performed by taking the inner product of the natural coordinate with the `Layout`'s `Stride`.
-
-Take the layout `(3,(2,3)):(3,(12,1))`, for example. Then a natural coordinate `(i,(j,k))` will result in the index `i*3 + j*12 + k*1`. The indices this layout computes are shown in the 2-D table below where `i` is used as the row coordinate and `(j,k)` is used as the column coordinate.
-
-```
-       0     1     2     3     4     5     <== 1-D col coord
-     (0,0) (1,0) (0,1) (1,1) (0,2) (1,2)   <== 2-D col coord (j,k)
-    +-----+-----+-----+-----+-----+-----+
- 0  |  0  |  12 |  1  |  13 |  2  |  14 |
-    +-----+-----+-----+-----+-----+-----+
- 1  |  3  |  15 |  4  |  16 |  5  |  17 |
-    +-----+-----+-----+-----+-----+-----+
- 2  |  6  |  18 |  7  |  19 |  8  |  20 |
-    +-----+-----+-----+-----+-----+-----+
-```
-
-The function `cute::crd2idx(c, shape, stride)` is responsible for the index mapping. It will take any coordinate within the shape, compute the equivalent natural coordinate for that shape (if it is not already), and compute the inner product with the strides.
-```cpp
-auto shape  = Shape <_3,Shape<  _2,_3>>{};
-auto stride = Stride<_3,Stride<_12,_1>>{};
-print(crd2idx(   16, shape, stride));       // 17
-print(crd2idx(_16{}, shape, stride));       // _17
-print(crd2idx(make_coord(   1,   5), shape, stride));  // 17
-print(crd2idx(make_coord(_1{},   5), shape, stride));  // 17
-print(crd2idx(make_coord(_1{},_5{}), shape, stride));  // _17
-print(crd2idx(make_coord(   1,make_coord(   1,   2)), shape, stride));  // 17
-print(crd2idx(make_coord(_1{},make_coord(_1{},_2{})), shape, stride));  // _17
-```
-
-## Layout Manipulation
-
-### Sublayouts
-
-Sublayouts can be retrieved with `layout<I...>`
-```cpp
-Layout a   = Layout<Shape<_4,Shape<_3,_6>>>{}; // (4,(3,6)):(1,(4,12))
-Layout a0  = layout<0>(a);                     // 4:1
-Layout a1  = layout<1>(a);                     // (3,6):(4,12)
-Layout a10 = layout<1,0>(a);                   // 3:4
-Layout a11 = layout<1,1>(a);                   // 6:12
-```
-or `select<I...>`
-```cpp
-Layout a   = Layout<Shape<_2,_3,_5,_7>>{};     // (2,3,5,7):(1,2,6,30)
-Layout a13 = select<1,3>(a);                   // (3,7):(2,30)
-Layout a01 = select<0,1,3>(a);                 // (2,3,7):(1,2,30)
-Layout a2  = select<2>(a);                     // (5):(6)
-```
-or `take<ModeBegin, ModeEnd>`
-```cpp
-Layout a   = Layout<Shape<_2,_3,_5,_7>>{};     // (2,3,5,7):(1,2,6,30)
-Layout a13 = take<1,3>(a);                     // (3,5):(2,6)
-Layout a14 = take<1,4>(a);                     // (3,5,7):(2,6,30)
-// take<1,1> not allowed. Empty layouts not allowed.
-```
-
-### Concatenation
-
-A `Layout` can be provided to `make_layout` to wrap and concatenate
-```cpp
-Layout a = Layout<_3,_1>{};                     // 3:1
-Layout b = Layout<_4,_3>{};                     // 4:3
-Layout row = make_layout(a, b);                 // (3,4):(1,3)
-Layout col = make_layout(b, a);                 // (4,3):(3,1)
-Layout q   = make_layout(row, col);             // ((3,4),(4,3)):((1,3),(3,1))
-Layout aa  = make_layout(a);                    // (3):(1)
-Layout aaa = make_layout(aa);                   // ((3)):((1))
-Layout d   = make_layout(a, make_layout(a), a); // (3,(3),3):(1,(1),1)
-```
-or can be combined with `append`, `prepend`, or `replace`.
-```cpp
-Layout a = Layout<_3,_1>{};                     // 3:1
-Layout b = Layout<_4,_3>{};                     // 4:3
-Layout ab = append(a, b);                       // (3,4):(1,3)
-Layout ba = prepend(a, b);                      // (4,3):(3,1)
-Layout c  = append(ab, ab);                     // (3,4,(3,4)):(1,3,(1,3))
-Layout d  = replace<2>(c, b);                   // (3,4,4):(1,3,3)
-```
-
-### Grouping and flattening
-
-Layout modes can be grouped with `group<ModeBegin, ModeEnd>` and flattened with `flatten`.
-```cpp
-Layout a = Layout<Shape<_2,_3,_5,_7>>{};  // (_2,_3,_5,_7):(_1,_2,_6,_30)
-Layout b = group<0,2>(a);                 // ((_2,_3),_5,_7):((_1,_2),_6,_30)
-Layout c = group<1,3>(b);                 // ((_2,_3),(_5,_7)):((_1,_2),(_6,_30))
-Layout f = flatten(b);                    // (_2,_3,_5,_7):(_1,_2,_6,_30)
-Layout e = flatten(c);                    // (_2,_3,_5,_7):(_1,_2,_6,_30)
-```
-Grouping, flattening, and reordering modes allows the reinterpretation of tensors in place as matrices, matrices as vectors, vectors as matrices, etc.
-
-### Slicing
-
-`Layout`s can be sliced, but slicing is more appropriate to perform on `Tensor`s. See the [`Tensor` section](./03_tensor.md) for slicing details.
-
-## Summary
-
-* The `Shape` of a `Layout` defines its coordinate space(s).
-
-    * Every `Layout` has a 1-D coordinate space.
-      This can be used to iterate over the coordinate spaces in a colexicographical order.
-
-    * Every `Layout` has a R-D coordinate space,
-      where R is the rank of the layout.
-      The colexicographical enumeration of the R-D coordinates
-      correspond to the 1-D coordinates above.
-
-    * Every `Layout` has an h-D (natural) coordinate space where h is "hierarchical." These are ordered colexicographically and the enumeration of that order corresponds to the 1-D coordinates above. A natural coordinate is *congruent* to the `Shape` so that each element of the coordinate has a corresponding element of the `Shape`.
-
-* The `Stride` of a `Layout` maps coordinates to indices.
-
-    * The inner product of the elements of the natural coordinate with the elements of the `Stride` produces the resulting index.
-
-For each `Layout` there exists an integral `Shape` that is that compatible with that `Layout`. Namely, that integral shape is `size(layout)`. We can then observe that
-
-> Layouts are functions from integers to integers.
-
-If you're familiar with the C++23 feature `mdspan`,
-this is an important difference between
-`mdspan` layout mappings and CuTe `Layout`s. In CuTe, `Layout` is a first class citizen, is natively hierarchical to naturally represent functions beyond row-major and column-major, and can similarly be indexed with a hierarchy of coordinates.
-(`mdspan` layout mappings can represent hierarchical functions as well,
-but this requires defining a custom layout.)
-Input coordinates for an `mdspan` must have the same shape as the `mdspan`;
-a multidimensional `mdspan` does not accept 1-D coordinates.
diff --git a/executor/op-mem-cuda/docs/02_layout_algebra.md b/executor/op-mem-cuda/docs/02_layout_algebra.md
deleted file mode 100644
index fbadc132..00000000
--- a/executor/op-mem-cuda/docs/02_layout_algebra.md
+++ /dev/null
@@ -1,574 +0,0 @@
-# CuTe Layout Algebra
-
-CuTe provides an "algebra of `Layout`s" to support combining layouts in different ways.  This algebra includes operations such as
-
-* `Layout` functional composition,
-* a notion of `Layout` "product" to reproduce one layout according to another, and
-* a notion of `Layout` "divide" to split one layout according to another.
-
-Common utilities for building complicated layouts from simpler ones depend on the `Layout` product. Common utilities for partitioning layouts (of data, for example) across other layouts (of threads, for example) depend on the `Layout` divide. All of these utilities rely on the functional composition of `Layout`s.
-
-In this section, we'll build up the tools of the `Layout` algebra and explain some of these core operations in detail.
-
-## Coalesce
-
-In the previous section, we summarized `Layout`s with
-> Layouts are functions from integers to integers.
-
-The `coalesce` operation is a "simplify" on functions from integers to integers. If we only care about input integers, then we can manipulate the shape and number of modes of the `Layout` without changing it as a function. The only thing `coalesce` can't change is the `Layout`'s `size`.
-
-More specifically, you can find the checked post-conditions in [the `coalesce` unit test](../../../test/unit/cute/core/coalesce.cpp), which we'll reproduce here:
-```cpp
-// @post size(@a result) == size(@a layout)
-// @post depth(@a result) <= 1
-// @post for all i, 0 <= i < size(@a layout), @a result(i) == @a layout(i)
-Layout coalesce(Layout const& layout)
-```
-
-For example,
-
-```cpp
-auto layout = Layout<Shape <_2,Shape <_1,_6>>,
-                     Stride<_1,Stride<_6,_2>>>{};
-auto result = coalesce(layout);    // _12:_1
-```
-
-where we can see the result has fewer modes and is "simpler." Indeed, this could save us a few operations in the coordinate mapping and index mapping (if those are performed dynamically).
-
-So, how do we get there?
-
-* We've already seen that column-major `Layout`s like `(_2,_4):(_1,_2)` act identically to `_8:_1` for 1-D coordinates.
-* Modes with size static-1 will always produce a natural coordinate of static-0. They can be ignored no matter the stride.
-
-Generalizing, consider a layout with just two integral modes, s0:d0 and s1:d1.  Denote the result of coalescing this layout as s0:d0 ++ s1:d1. Then, there are four cases:
-
-1. `s0:d0  ++  _1:d1  =>  s0:d0`. Ignore modes with size static-1.
-2. `_1:d0  ++  s1:d1  =>  s1:d1`. Ignore modes with size static-1.
-3. `s0:d0  ++  s1:s0*d0  =>  s0*s1:d0`. If the second mode's stride is the product of the first mode's size and stride, then they can be combined.
-4. `s0:d0  ++  s1:d1  =>  (s0,s1):(d0,d1)`. Else, nothing can be done and they must be treated separately.
-
-That's it! We can flatten any layout and apply the above binary operation to each pair of adjacent modes in order to "coalesce" the modes of the layout.
-
-### By-mode Coalesce
-
-Obviously, sometimes we do care about the shape of our `Layout`, but would still like to coalesce. For example, I have a 2-D `Layout` and I would like the result to remain 2-D.
-
-For this reason, there's an overload of `coalesce` that takes an additional parameter
-```cpp
-// Apply coalesce at the terminals of trg_profile
-Layout coalesce(Layout const& layout, IntTuple const& trg_profile)
-```
-
-which can be used as follows
-
-```cpp
-auto a = Layout<Shape <_2,Shape <_1,_6>>,
-                Stride<_1,Stride<_6,_2>>>{};
-auto result = coalesce(a, Step<_1,_1>{});   // (_2,_6):(_1,_2)
-// Identical to
-auto same_r = make_layout(coalesce(layout<0>(a)),
-                          coalesce(layout<1>(a)));
-```
-
-This function is recursing into `Step<_1,_1>{}` and applying `coalesce` to the corresponding sublayout whenever it sees an integer (the values don't matter, they're just flags) rather than a tuple.
-
-> This theme of defining an operation that treats a `Layout` as a "1-D" function from integers to integers and then generalizing to use it for an arbitrarily shaped layout will be a common one!
-
-## Composition
-
-Functional composition of `Layout`s is the core of CuTe and is used in just about every higher-level operation.
-
-Starting again from the observation that `Layout`s are just functions from integers to integers, we can define functional composition that results in another `Layout`. First, an example.
-
-```text
-Functional composition, R := A o B
-R(c) := (A o B)(c) := A(B(c))
-
-Example
-A = (6,2):(8,2)
-B = (4,3):(3,1)
-
-R( 0) = A(B( 0)) = A(B(0,0)) = A( 0) = A(0,0) =  0
-R( 1) = A(B( 1)) = A(B(1,0)) = A( 3) = A(3,0) = 24
-R( 2) = A(B( 2)) = A(B(2,0)) = A( 6) = A(0,1) =  2
-R( 3) = A(B( 3)) = A(B(3,0)) = A( 9) = A(3,1) = 26
-R( 4) = A(B( 4)) = A(B(0,1)) = A( 1) = A(1,0) =  8
-R( 5) = A(B( 5)) = A(B(1,1)) = A( 4) = A(4,0) = 32
-R( 6) = A(B( 6)) = A(B(2,1)) = A( 7) = A(1,1) = 10
-R( 7) = A(B( 7)) = A(B(3,1)) = A(10) = A(4,1) = 34
-R( 8) = A(B( 8)) = A(B(0,2)) = A( 2) = A(2,0) = 16
-R( 9) = A(B( 9)) = A(B(1,2)) = A( 5) = A(5,0) = 40
-R(10) = A(B(10)) = A(B(2,2)) = A( 8) = A(2,1) = 18
-R(11) = A(B(11)) = A(B(3,2)) = A(11) = A(5,1) = 42
-```
-
-The absolutely amazing observation is that the function `R(c) = k` defined above can be written down as another `Layout`
-
-```
-R = ((2,2),3):((24,2),8)
-```
-
-AND
-
-```
-compatible(B, R)
-```
-
-That is, every coordinate of `B` can also be used as a coordinate of `R`. This is an expected property of functional composition because `B` defines the *domain* of `R`.
-
-You can find many examples and checked post-conditions in [the `composition` unit test](../../../test/unit/cute/core/composition.cpp). The post-conditions are precisely as we just stated.
-```cpp
-// @post compatible(@a layout_b, @a result)
-// @post for all i, 0 <= i < size(@a layout_b), @a result(i) == @a layout_a(@a layout_b(i)))
-Layout composition(LayoutA const& layout_a, LayoutB const& layout_b)
-```
-
-### Computing Composition
-
-First, a few observations:
-
-* `B = (B_0, B_1, ...)`. A layout can be expressed as the concatenation of its sublayouts.
-
-* `A o B = A o (B_0, B_1, ...) = (A o B_0, A o B_1, ...)`. When `B` is injective, composition is left-distributive with concatenation.
-
-With the above, we can assume without loss of generality that `B = s:d` is a layout with integral shape and stride. We can also assume that `A` is a flattened, coalesced layout.
-
-When `A` is integral, `A = a:b`, the result is rather trivial: `R = A o B = a:b o s:d = s:(b*d)`. But when `A` is multimodal, we need to be more careful.
-
-Put into words, `A o B = A o s:d`, for integral `s` and `d` means that we want (1) every `d`th element of `A`, and then (2) keep the first `s` of those strided elements.
-
-1. Every `d`th element of `A` can be computed by "dividing out" the first `d` elements from the shape of `A`. For an array of integers representing the shape, this is computed as
-```cpp
-void shape_div(int* shapeA, int N, int& strideB) {
-   for (int i = 0; i < N; ++i) {
-      assert(shapeA[i] %   strideB == 0 or
-               strideB % shapeA[i] == 0);
-      int new_shape  = ceil_div(shapeA[i], strideB);
-      int new_stride = ceil_div(strideB, shapeA[i]);
-      shapeA[i] = new_shape;
-      strideB   = new_stride;
-   }
-}
-```
-which progressively "removes" the first `strideB` elements from `shapeA` starting from the left. For example,
-* `(6,2) /  2 => (3,2)`
-* `(6,2) /  3 => (2,2)`
-* `(6,2) /  6 => (1,2)`
-* `(6,2) / 12 => (1,1)`
-* `(3,6,2,8) / 6 => (1,3,2,8)`
-* `(3,6,2,8) / 9 => (1,2,2,8)`
-* `(42,16,3) / 2 => (21,16,3)`
-* `(42,16,3) / 6 => ( 7,16,3)`
-
-As you may have noticed, we can only divide shapes by certain values and get a sensible result. This is called the **divisibility condition** and is enforced by the `assert` in the above code and statically checked in CuTe when possible.
-
-2. The first `s` elements of the strided `A` layout can be computed by "modding out" the first `s` elements from the shape of `A`. For an array of integers representing the shape, this is computed as
-```cpp
-void shape_mod(int* shapeA, int N, int& shapeB) {
-   for (int i = 0; i < N; ++i) {
-      assert(shapeA[i] %    shapeB == 0 or
-                shapeB % shapeA[i] == 0);
-      int new_shapeA =      min(shapeA[i], shapeB);
-      int new_shapeB = ceil_div(shapeB, shapeA[i]);
-      shapeA[i] = new_shapeA;
-      shapeB    = new_shapeB;
-   }
-}
-```
-which progressibly "keeps" the first `shapeB` elements from `shapeA` starting from the left. For example,
-* `(6,2) %  2 => (2,1)`
-* `(6,2) %  3 => (3,1)`
-* `(6,2) %  6 => (6,1)`
-* `(6,2) % 12 => (6,2)`
-* `(3,6,2,8) %  6 => (3,2,1,1)`
-* `(3,6,2,8) %  9 => (3,3,1,1)`
-* `(1,2,2,8) %  2 => (1,2,1,1)`
-* `(1,2,2,8) % 16 => (1,2,2,4)`
-
-Again, this operation must satisfy the divisibility condition to yield a sensible result. This is enforced by the `assert` in the above code and statically checked in CuTe when possible.
-
-Clearly, CuTe does not use arrays to store shapes or strides and the above code is for explication only. CuTe works with shapes and strides as `IntTuple`s and the implementation is expressed as algorithmic `fold`s which carefully account for static and dynamic integers.
-
-#### Example 1 -- Reshape a layout into a matrix
-
-`20:2  o  (5,4):(4,1)`. Composition formulation.
-
-This describes interpreting the layout `20:2`
-as a 5x4 matrix in a row-major order.
-
-1. ` = 20:2 o (5:4,4:1)`. Layout `(5,4):(4,1)` as concatenation of sublayouts.
-
-2. ` = (20:2 o 5:4, 20:2 o 4:1)`. Left distributivity.
-
-    * `20:2 o 5:4  =>  5:8`. Trivial case.
-    * `20:2 o 4:1  =>  4:2`. Trivial case.
-
-3. ` = (5:8, 4:2)`. Composed Layout as concatenation of sublayouts.
-
-4. ` = (5,4):(8,2)`. Final composed layout.
-
-#### Example 2 -- Reshape a layout into a matrix
-
-`(10,2):(16,4)  o  (5,4):(1,5)`
-
-This describes interpreting the layout `(10,2):(16,4)`
-as a 5x4 matrix in a column-major order.
-
-1. ` = (10,2):(16,4) o (5:1,4:5)`. Layout `(5,4):(1,5)` as concatenation of sublayouts.
-
-2. ` = ((10,2):(16,4) o 5:1, (10,2):(16,4) o 4:5)`. Left distributivity.
-
-    * `(10,2):(16,4) o 5:1 => (5,1):(16,4)`. Mod out the shape `5`.
-    * `(10,2):(16,4) o 4:5 => (2,2):(80,4)`. Div out the stride `5`.
-
-3. ` = ((5,1):(16,4), (2,2):(80,4))`. Composed Layout as concatenation of sublayouts.
-
-4. ` = (5:16, (2,2):(80,4))`. By-mode coalesce.
-
-5. ` = (5,(2,2))):(16,(80,4))`. Final composed layout.
-
-We get exactly this result with CuTe
-if we use compile-time shapes and strides.
-The following C++ code prints `(_5,(_2,_2)):(_16,(_80,_4))`.
-
-```cpp
-Layout a = make_layout(make_shape (Int<10>{}, Int<2>{}),
-                       make_stride(Int<16>{}, Int<4>{}));
-Layout b = make_layout(make_shape (Int< 5>{}, Int<4>{}),
-                       make_stride(Int< 1>{}, Int<5>{}));
-Layout c = composition(a, b);
-print(c);
-```
-
-If we use dynamic integers, the following C++ code prints `((5,1),(2,2)):((16,4),(80,4))`.
-
-```cpp
-Layout a = make_layout(make_shape (10, 2),
-                       make_stride(16, 4));
-Layout b = make_layout(make_shape ( 5, 4),
-                       make_stride( 1, 5));
-Layout c = composition(a, b);
-print(c);
-```
-
-The results may _look_ different but are the mathematically the same. The 1s in the shape don't affect the layout as a mathematical function from 1-D coordinates to integers or as a function from 2-D coordinates to integers. In the dynamic case, CuTe can not coalesce the dynamic size-1 modes to "simplify" the layout due to the static rank and type of the tuples containing them.
-
-### By-mode Composition
-
-Similar to by-mode `coalesce` and building up to a generic tiling operation, sometimes we do care about the shape of the `A` layout and would still like to apply `composition` to individual modes. For example, I have a 2-D `Layout` and would like some sublayout of the elements down the columns and another sublayout of elements across the rows.
-
-For this reason, `composition` also works when its second parameter -- the `B` -- is a `Tiler`. In general, a tiler is a layout or a tuple-of-layouts (note the generalization on `IntTuple`), which can be used as follows
-```cpp
-// (12,(4,8)):(59,(13,1))
-auto a = make_layout(make_shape (12,make_shape ( 4,8)),
-                     make_stride(59,make_stride(13,1)));
-// <3:4, 8:2>
-auto tiler = make_tile(Layout<_3,_4>{},  // Apply 3:4 to mode-0
-                       Layout<_8,_2>{}); // Apply 8:2 to mode-1
-
-// (_3,(2,4)):(236,(26,1))
-auto result = composition(a, tiler);
-// Identical to
-auto same_r = make_layout(composition(layout<0>(a), get<0>(tiler)),
-                          composition(layout<1>(a), get<1>(tiler)));
-```
-We often use the `<LayoutA, LayoutB, ...>` notation to distinguish `Tiler`s from the concatenation-of-sublayouts notation `(LayoutA, LayoutB, ...)` that we used previously.
-
-The `result` in the above code can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
-<p align="center">
-  <img src="../../images/cute/composition1.png" alt="composition1.png" height="250"/>
-</p>
-
-For convenience, CuTe also interprets `Shape`s as a tiler as well. A `Shape` is interpreted as tuple-of-layouts-with-stride-1:
-```cpp
-// (12,(4,8)):(59,(13,1))
-auto a = make_layout(make_shape (12,make_shape ( 4,8)),
-                     make_stride(59,make_stride(13,1)));
-// (8, 3)
-auto tiler = make_shape(Int<3>{}, Int<8>{});
-// Equivalent to <3:1, 8:1>
-// auto tiler = make_tile(Layout<_3,_1>{},  // Apply 3:1 to mode-0
-//                        Layout<_8,_1>{}); // Apply 8:1 to mode-1
-
-// (_3,(4,2)):(59,(13,1))
-auto result = composition(a, tiler);
-```
-where `result` can be depicted as the 3x8 sublayout of the original layout highlighted in the figure below.
-<p align="center">
-  <img src="../../images/cute/composition2.png" alt="composition2.png" height="250"/>
-</p>
-
-## Composition Tilers
-
-In summary, a `Tiler` is one of the following objects.
-1. A `Layout`.
-2. A tuple of `Tiler`s.
-3. A `Shape`, which will be interpreted as a tiler of `Layout`s with stride-1.
-
-Any of the above can be used as the second argument in `composition`. With (1), we think of the `composition` as between two functions from integers to integers, no matter the ranks of the layouts. With (2) and (3), the `composition` is performed on each pair of corresponding modes of `A` and `B`, until case (1) is found.
-
-This allows composition to be applied by-mode to retrieve arbitrary sublayouts of specified modes of a tensor ("Give me the 3x5x8 subblock of this MxNxL tensor") but also allows entire tiles of data to be reshaped and reordered as if they were 1-D vectors ("Reorder this 8x16 block of data into a 32x4 block using this weird order of elements"). We will see the by-mode cases appear often when we are tiling for threadblocks in examples that follow. We will see 1-D reshaping and reordering when we want to apply arbitrary partitioning patterns for threads and values in MMAs in examples that follow.
-
-## Complement
-
-Before getting to "product" and "divide," we need one more operation. We can think of `composition` as a layout `B` that is "selecting" certain coordinates from another layout `A`. But what about the coordinates that aren't "selected"? To implement generic tiling, we want to be able to select arbitrary elements -- the tile -- and to describe the layout of those tiles -- the leftovers, or the "rest."
-
-The `complement` of a layout attempts to find another layout that represents the "rest" -- the elements that aren't touched by the layout.
-
-You can find many examples and checked post-conditions in [the `complement` unit test](../../../test/unit/cute/core/complement.cpp). The post-conditions include
-```cpp
-// @post cosize(make_layout(@a layout_a, @a result))) >= size(@a cotarget)
-// @post cosize(@a result) >= round_up(size(@a cotarget), cosize(@a layout_a))
-// @post for all i, 1 <= i < size(@a result),
-//         @a result(i-1) < @a result(i)
-// @post for all i, 1 <= i < size(@a result),
-//         for all j, 0 <= j < size(@a layout_a),
-//           @a result(i) != @a layout_a(j)
-Layout complement(LayoutA const& layout_a, Shape const& cotarget)
-```
-That is, the complement `R` of a layout `A` with respect to a Shape (IntTuple) `M` satisfies the following properties.
-1. The size (and cosize) of `R` is *bounded* by `size(M)`.
-2. `R` is *ordered*.  That is, the strides of `R` are positive and increasing.  This means that `R` is unique.
-3. `A` and `R` have *disjoint* codomains. `R` attempts to "complete" the codomain of `A`.
-
-The `cotarget` parameter above is most commonly an integer -- you can see we only use `size(cotarget)` above. However, sometimes it is useful to specify an integer that has static properties. For example, `28` is a dynamic integer and `(_4,7)` is a shape with size `28` that is statically known to be divisible by `_4`. Both will produce the same `complement` mathematically, but the extra information can used by `complement` to preserve the staticness of the result as much as possible.
-
-### Complement Examples
-
-`complement` is most effective on static shapes and strides, so consider all integers below to be static. Similar examples for dynamic shapes and strides as well as IntTuple `cotarget` can be found in [the unit test](../../../test/unit/cute/core/complement.cpp).
-
-* `complement(4:1, 24)` is `6:4`. Note that `(4,6):(1,4)` has cosize `24`. The layout `4:1` is effectively repeated 6 times with `6:4`.
-
-* `complement(6:4, 24)` is `4:1`. Note that `(6,4):(4,1)` has cosize `24`. The "hole" in `6:4` is filled with `4:1`.
-
-* `complement((4,6):(1,4), 24)` is `1:0`. Nothing needs to be appended.
-
-* `complement(4:2, 24)` is `(2,3):(1,8)`. Note that `(4,(2,3)):(2,(1,8))` has cosize `24`. The "hole" in `4:2` is filled with `2:1` first, then everything is repeated 3 times with `3:8`.
-
-* `complement((2,4):(1,6), 24)` is `3:2`. Note that `((2,4),3):((1,6),2)` has cosize `24` and produces unique indices.
-
-* `complement((2,2):(1,6), 24)` is `(3,2):(2,12)`. Note that `((2,2),(3,2)):((1,6),(2,12))` has cosize `24` and produces unique indices.
-
-<p align="center">
-  <img src="../../images/cute/complement1.png" alt="complement1.png" height="75"/>
-</p>
-As a visualization, the above figure depicts the codomain of the last example. The image of the original layout `(2,2):(1,6)` is colored in gray. The complement effectively "repeats" the original layout (displayed in the other colors) such that the codomain size of the result is `24`. The complement `(3,2):(2,12)` can be viewed as the "layout of the repetition."
-
-## Division (Tiling)
-
-Finally, we can define the division of a `Layout` by another `Layout`. Functions that divide a layout into components are useful as a basis for tiling and partitioning layouts.
-
-In this section, we'll define `logical_divide(Layout, Layout)`, which again considers all `Layout`s as 1-D functions from integers to integers, and then use that definition to create multidimensional `Layout` divides.
-
-Informally, `logical_divide(A, B)` splits a layout `A` into two modes -- in the first mode are all elements pointed to by `B` and in the second mode are all elements not pointed to by `B`.
-
-Formally, this can be written as
-
-$A \oslash B := A \circ (B,B^*)$
-
-and implemented as
-```cpp
-template <class LShape, class LStride,
-          class TShape, class TStride>
-auto logical_divide(Layout<LShape,LStride> const& layout,
-                    Layout<TShape,TStride> const& tiler)
-{
-  return composition(layout, make_layout(tiler, complement(tiler, size(layout))));
-}
-```
-Note that this is defined only in terms of concatenation, composition, and complement.
-
-So what is that?
-
-> in the first mode are all elements pointed to by `B`
-
-This is clearly composition, `A o B`.
-
-> in the second mode are all elements not pointed to by `B`
-
-The elements NOT pointed to by `B` sounds like a complement, `B*`, up to the size of `A`. As we've seen above in the `complement` section, this can be described as the "layout of the repetition of `B`." If `B` is the "tiler", then `B*` is the layout of the tiles.
-
-### Logical Divide 1-D Example
-
-Consider tiling the 1-D layout `A = (4,2,3):(2,1,8)` with the tiler `B = 4:2`. Informally, this means that we have a 1-D vector of 24 elements in some storage order defined by `A` and we want to extract tiles of 4 elements strided by 2.
-
-This is computed in the three steps described in the implementation above.
-* Complement of `B = 4:2` under `size(A) = 24` is `B* = (2,3):(1,8)`.
-* Concantenation of `(B,B*) = (4,(2,3)):(2,(1,8))`.
-* Composition of `A = (4,2,3):(2,1,8)` with `(B,B*)` is then `((2,2),(2,3)):((4,1),(2,8))`.
-
-<p align="center">
-  <img src="../../images/cute/divide1.png" alt="divide1.png" height="150"/>
-</p>
-
-The above figure depicts `A` as a 1-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are six of those tiles in `A` shown by each of the colors. After the divide, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
-
-### Logical Divide 2-D Example
-
-Using the `Tiler` concept defined above, this immediately generalizes to multidimensional tiling. The below example simply applies `layout_divide` by-mode to the cols and rows of a 2-D layout using a `Tiler`.
-
-Similar to the 2-D composition example above, consider a 2-D layout `A = (9,(4,8)):(59,(13,1))` and want to apply `3:3` down the columns (mode-0) and `(2,4):(1,8)` across the rows (mode-1). This means the tiler can be written as `B = <3:3, (2,4):(1,8)>`.
-
-<p align="center">
-  <img src="../../images/cute/divide2.png" alt="divide2.png" height="450"/>
-</p>
-
-The above figure depicts `A` as a 2-D layout with the elements pointed to by `B` highlighted in gray. The layout `B` describes our "tile" of data, and there are twelve of those tiles in `A` shown by each of the colors. After the divide, the first mode of each mode of the result is the tile of data and the second mode of each mode iterates over each tile. In that sense, this operation can be viewed as a kind of `gather` operation or as simply a permutation on the rows and cols.
-
-Note that the first mode of each mode of the result is the sublayout `(3,(2,4)):(177,(13,2))` and is precisely the result we would have received if we had applied `composition` instead of `logical_divide`.
-
-### Zipped, Tiled, Flat Divides
-
-It's easy to see the tiles when they are highlighted in the images above, but working with them can still be awkward. How would you slice out the `3`rd tile or the `7`th tile or the `(1,2)`th tile so you could continue working on it?
-
-Enter the convenience flavors of `logical_divide`. Suppose we have a `Layout` and a `Tiler` of some shape, then each operation will apply `logical_divide`, but potentially rearrange the modes into more convenient forms.
-```text
-Layout Shape : (M, N, L, ...)
-Tiler Shape  : <TileM, TileN>
-
-logical_divide : ((TileM,RestM), (TileN,RestN), L, ...)
-zipped_divide  : ((TileM,TileN), (RestM,RestN,L,...))
-tiled_divide   : ((TileM,TileN), RestM, RestN, L, ...)
-flat_divide    : (TileM, TileN, RestM, RestN, L, ...)
-```
-
-For example, the `zipped_divide` function applies `logical_divide`, and then gathers the "subtiles" into a single mode and the "rest" into a single mode.
-```cpp
-// A: shape is (9,32)
-auto layout_a = make_layout(make_shape (Int< 9>{}, make_shape (Int< 4>{}, Int<8>{})),
-                            make_stride(Int<59>{}, make_stride(Int<13>{}, Int<1>{})));
-// B: shape is (3,8)
-auto tiler = make_tile(Layout<_3,_3>{},           // Apply     3:3     to mode-0
-                       Layout<Shape <_2,_4>,      // Apply (2,4):(1,8) to mode-1
-                              Stride<_1,_8>>{});
-
-// ((TileM,RestM), (TileN,RestN)) with shape ((3,3), (8,4))
-auto ld = logical_divide(layout_a, tiler);
-// ((TileM,TileN), (RestM,RestN)) with shape ((3,8), (3,4))
-auto zd = zipped_divide(layout_a, tiler);
-```
-Then, the offset to the `3`rd tile is `zd(0,3)`. The offset to the `7`th tile is `zd(0,7)`. The offset to the `(1,2)`th tile is `zd(0,make_coord(1,2))`. The tile itself always has layout `layout<0>(zd)`. Indeed, it is always the case that
-
-`layout<0>(zipped_divide(a, b)) == composition(a, b)`.
-
-We note that `logical_divide` preserves the *semantics* of the modes while permuting the elements within those modes -- the `M`-mode of layout `A` is still the `M`-mode of the result and the `N`-mode of layout `A` is still the `N`-mode of the result.
-
-This is not the case with `zipped_divide`. The mode-0 in the `zipped_divide` result is the `Tile` itself (of whatever rank the `Tiler` was) and mode-1 is the layout of those tiles. It doesn't always make sense to plot these as 2-D layouts, because the `M`-mode is now more aptly the "tile-mode" and the `N`-mode is more aptly the "rest-mode". Regardless, we still can plot the resulting layout as 2-D as shown below.
-
-<p align="center">
-  <img src="../../images/cute/divide3.png" alt="divide3.png" height="450"/>
-</p>
-
-We've kept each tile as its color in the previous images for clarity. Clearly, iterating across tiles is now equivalent to iterating across a row of this layout and iterating over elements within a tile is equivalent to iterating down a column of this layout. As we'll see in the `Tensor` section, this can be used to great effect in partitioning within or across tiles of data.
-
-## Product (Tiling)
-
-Finally, we can define the product of a Layout by another Layout. In this section, we'll define `logical_product(Layout, Layout)`, which again considers all `Layout`s as 1-D functions from integers to integers, and then use that definition to create multidimensional `Layout` products.
-
-Informally, `logical_product(A, B)` results in a two mode layout where the first mode is the layout `A` and the second mode is the layout `B` but with each element replaced by a "unique replication" of layout `A`.
-
-Formally, this can be written as
-
-$A \otimes B := (A, A^* \circ B)$
-
-and implemented in CuTe as
-```cpp
-template <class LShape, class LStride,
-          class TShape, class TStride>
-auto logical_product(Layout<LShape,LStride> const& layout,
-                     Layout<TShape,TStride> const& tiler)
-{
-  return make_layout(layout, composition(complement(layout, size(layout)*cosize(tiler)), tiler));
-}
-```
-Note that this is defined only in terms of concatenation, composition, and complement.
-
-So what is that?
-
-> where the first mode is the layout `A`
-
-This is clearly just a copy of `A`.
-
-> the second mode is the layout `B` but with each element replaced by a "unique replication" of layout `A`.
-
-The "unique replication" of layout `A` sounds like complement, `A*`, up to the cosize of `B`. As we've seen in the `complement` section, this can be described as the "layout of the repetition of `A`". If `A` is the "tile", then `A*` is the layout of repetitions that are available for `B`.
-
-### Logical Product 1-D Example
-
-Consider reproducing the 1-D layout `A = (2,2):(4,1)` according to `B = 6:1`. Informally, this means that we have a 1-D layout of 4 elements defined by `A` and we want to reproduce it 6 times.
-
-This is computed in the three steps described in the implementation above.
-* Complement of `A = (2,2):(4,1)` under `6*4 = 24` is `A* = (2,3):(2,8)`.
-* Composition of `A* = (2,3):(2,8)` with `B = 6:1` is then `(2,3):(2,8)`.
-* Concatenation of `(A,A* o B) = ((2,2),(2,3)):((4,1),(2,8))`.
-
-<p align="center">
-  <img src="../../images/cute/product1.png" alt="product1.png" height="175"/>
-</p>
-
-The above figure depicts `A` and `B` as a 1-D layouts. The layout `B` describes the number and order of repetitions of `A` and they are colored for clarity. After the product, the first mode of the result is the tile of data and the second mode of the result iterates over each tile.
-
-Note that the result is identical to the result of the 1-D Logical Divide example.
-
-Of course, we can change the number and order of the tiles in the product by changing `B`.
-
-<p align="center">
-  <img src="../../images/cute/product2.png" alt="product2.png" height="175"/>
-</p>
-
-For example, in the above image with `B = (4,2):(2,1)`, there are 8 repeated tiles instead of 6 and the tiles are in a different order.
-
-### Logical Product 2-D Example
-
-We can use the by-mode `tiler` strategies previously developed to write multidimensional products as well.
-
-<p align="center">
-  <img src="../../images/cute/product2d.png" alt="product2d.png" height="250"/>
-</p>
-
-The above image demonstates the use of a `tiler` to apply `logical_product` by-mode. Despite this **not being the recommended approach**, the result is a rank-2 layout consisting of 2x5 row-major block that is tiled across a 3x4 column-major arrangement.
-
-The reason **this is not the recommended approach** is that the `tiler B` in the above expression is highly unintuitive. In fact, it requires perfect knowledge of the shape and strides of `A` in order to construct. We would like to express "Tile Layout `A` according to Layout `B`" in a way that makes `A` and `B` independent and is much more intuitive.
-
-#### Blocked and Raked Products
-
-The `blocked_product(LayoutA, LayoutB)` and `raked_product(LayoutA, LayoutB)` are rank-sensitive transformations on top of 1-D `logical_product` that let us express the more intuitive `Layout` products that we most often want to express.
-
-A key observation in the implementation of these functions are the compatibility post-conditions of `logical_product`:
-```
-// @post rank(result) == 2
-// @post compatible(layout_a, layout<0>(result))
-// @post compatible(layout_b, layout<1>(result))
-```
-
-Because `A` is always compatible with mode-0 of the result and `B` is always compatible with mode-1 of the result, if we made `A` and `B` the same rank then we could "reassociate" like-modes after the product. That is, the "column" mode in `A` could be combined with the "column" mode in `B` and the "row" mode in `A` could be combined with the "row" mode in `B`, etc.
-
-This is exactly what `blocked_product` and `raked_product` do and it is why they are called rank-sensitive. Unlike other CuTe functions that take `Layout` arguments, these care about the top-level rank of the arguments so that each mode can be reassociated after the `logical_product`.
-
-<p align="center">
-  <img src="../../images/cute/productblocked2d.png" alt="productblocked2d.png" height="250"/>
-</p>
-
-The above image shows the same result as the `tiler` approach, but with much more intuitive arguments. A 2x5 row-major layout is arranged as a tile in a 3x4 column-major arrangement. Also note that `blocked_product` went ahead and `coalesced` mode-0 for us.
-
-Similarly, `raked_product` combines the modes slightly differently. Instead of the resulting "column" mode being constructed from the `A` "column" mode then the `B` "column" mode, the resulting "column" mode is constructed from the `B` "column" mode then the `A` "column" mode.
-
-<p align="center">
-  <img src="../../images/cute/productraked2d.png" alt="productraked2d.png" height="250"/>
-</p>
-
-This results in the "tile" `A` now being interleaved or "raked" with the "layout-of-tiles" `B` instead of appearing as blocks. Other references call this a "cyclic distribution."
-
-### Zipped and Tiled Products
-
-Similar to `zipped_divide` and `tiled_divide`, the `zipped_product` and `tiled_product` simply rearrange the modes that result from a by-mode `logical_product`.
-
-```text
-Layout Shape : (M, N, L, ...)
-Tiler Shape  : <TileM, TileN>
-
-logical_product : ((M,TileM), (N,TileN), L, ...)
-zipped_product  : ((M,N), (TileM,TileN,L,...))
-tiled_product   : ((M,N), TileM, TileN, L, ...)
-flat_product    : (M, N, TileM, TileN, L, ...)
-```
diff --git a/executor/op-mem-cuda/docs/03_tensor.md b/executor/op-mem-cuda/docs/03_tensor.md
deleted file mode 100644
index f2412d11..00000000
--- a/executor/op-mem-cuda/docs/03_tensor.md
+++ /dev/null
@@ -1,393 +0,0 @@
-# CuTe Tensors
-
-This document describes `Tensor`, CuTe's core container that deploys the `Layout` concepts previously described.
-
-Fundamentally, a `Tensor` represents a multidimensional array. `Tensor`s abstracts away the details of how the array's elements are organized and how the array's elements are stored. This lets users write algorithms that access multidimensional arrays generically and potentially specialize algorithms on a `Tensor`s traits. For example, the rank of the `Tensor` can be dispatched against, the `Layout` of data can be inspected, and the type of data can be verified.
-
-A `Tensor` is represented by two template parameters: `Engine` and `Layout`.
-For a description of `Layout`, please refer to [the `Layout` section](./01_layout.md).
-The `Tensor` presents the same shape and access operators as the `Layout` and uses the result of the `Layout` computation to
-offset and dereference a random-access iterator held by the `Engine`.
-That is, the layout of the data is provided by `Layout` and the actual data is provided by the iterator. Such data can live in any kind of memory -- global memory, shared memory, register memory -- or can even be transformed or generated on the fly.
-
-## Fundamental operations
-
-CuTe `Tensor` provides container-like operations for accessing elements.
-
-* `.data()`. The iterator this `Tensor` holds.
-
-* `.size()`. The total logical size of this `Tensor`.
-
-* `.operator[](Coord)`. Access the element corresponding to the logical coordinate `Coord`.
-
-* `.operator()(Coord)`. Access the element corresponding to the logical coordinate `Coord`.
-
-* `.operator()(Coords...)`. Access the element corresponding to the logical coordinate `make_coord(Coords...)`.
-
-CuTe `Tensor` provides a similar core of hierarchical operations as `Layout`.
-
-* `rank<I...>(Tensor)`. The rank of the `I...`th mode of the `Tensor`.
-
-* `depth<I...>(Tensor)`. The depth of the `I...`th mode of the `Tensor`.
-
-* `shape<I...>(Tensor)`. The shape of the `I...`th mode of the `Tensor`.
-
-* `size<I...>(Tensor)`. The size of the `I...`th mode of the `Tensor`.
-
-* `layout<I...>(Tensor)`. The layout of the `I...`th mode of the `Tensor`.
-
-* `tensor<I...>(Tensor)`. The subtensor corresponding to the the `I...`th mode of the `Tensor`.
-
-## Tensor Engines
-
-The `Engine` concept is a wrapper for an iterator or array of data.
-It uses a stripped-down interface of `std::array` to present the iterator.
-
-```c++
-using iterator     =  // The iterator type
-using value_type   =  // The iterator value-type
-using reference    =  // The iterator reference-type
-iterator begin()      // The iterator
-```
-
-In general, users do not need to construct `Engine`s on their own. When a `Tensor` is constructed,
-the appropriate engine -- often `ArrayEngine<T,N>`, `ViewEngine<Iter>`, or
-`ConstViewEngine<Iter>` -- will be constructed.
-
-### Tagged Iterators
-
-Any random-access iterator can be used to construct a `Tensor`, but
-users can also "tag" any iterator with a memory space --
-e.g., to indicate this iterator is accessing global memory or shared memory.
-This is done by calling `make_gmem_ptr(g)` or `make_gmem_ptr<T>(g)` to tag `g` as a global memory iterator,
-and `make_smem_ptr(s)` or `make_smem_ptr<T>(s)` to tag `s` as a shared memory iterator.
-
-Tagging memory makes it possible for CuTe's `Tensor` algorithms
-to use the fastest implementation for the specific kind(s) of memory.
-When calling very specific operations with `Tensor`s, it also allows those
-operators to verify the tags against what is expected.
-For example, some kinds of optimized copy operations require
-the source of the copy to be global memory
-and the destination of the copy to be shared memory.
-Tagging makes it possible for CuTe to dispatch
-to those copy operations and/or verify against those copy operations.
-
-## Tensor Creation
-
-`Tensor`s can be constructed as owning or nonowning.
-
-"Owning" `Tensor`s behave like `std::array`.
-When you copy the `Tensor`, you (deep-)copy its elements,
-and the `Tensor`'s destructor deallocates the array of elements.
-
-"Nonowning" `Tensor`'s behave like a (raw) pointer.
-Copying the `Tensor` doesn't copy the elements,
-and destroying the `Tensor` doesn't deallocate the array of elements.
-
-This has implications for developers of generic `Tensor` algorithms.
-For example, input `Tensor` parameters of a function
-should be passed by referece or const reference,
-because passing a `Tensor` by value
-may or may not make a deep copy of the `Tensor`'s elements.
-
-### Nonowning Tensors
-
-A `Tensor` is usually a nonowning view of existing memory.
-Nonowning `Tensor`s are created by calling `make_tensor`
-with two arguments: a random-access iterator, and the `Layout` or arguments to construct a `Layout`.
-
-Here are some examples of creating `Tensor`s
-that are nonowning views of existing memory.
-
-```cpp
-float* A = ...;
-
-// Untagged pointers
-Tensor tensor_8   = make_tensor(A, make_layout(Int<8>{}));  // Construct with Layout
-Tensor tensor_8s  = make_tensor(A, Int<8>{});               // Construct with Shape
-Tensor tensor_8d2 = make_tensor(A, 8, 2);                   // Construct with Shape and Stride
-
-// Global memory (static or dynamic layouts)
-Tensor gmem_8s     = make_tensor(make_gmem_ptr(A), Int<8>{});
-Tensor gmem_8d     = make_tensor(make_gmem_ptr(A), 8);
-Tensor gmem_8sx16d = make_tensor(make_gmem_ptr(A), make_shape(Int<8>{},16));
-Tensor gmem_8dx16s = make_tensor(make_gmem_ptr(A), make_shape (      8  ,Int<16>{}),
-                                                   make_stride(Int<16>{},Int< 1>{}));
-
-// Shared memory (static or dynamic layouts)
-Layout smem_layout = make_layout(make_shape(Int<4>{},Int<8>{}));
-__shared__ float smem[decltype(cosize(smem_layout))::value];   // (static-only allocation)
-Tensor smem_4x8_col = make_tensor(make_smem_ptr(smem), smem_layout);
-Tensor smem_4x8_row = make_tensor(make_smem_ptr(smem), shape(smem_layout), LayoutRight{});
-```
-
-As shown, users wrap the pointer by identifying its memory space:
-e.g., global memory (via `make_gmem_ptr` or `make_gmem_ptr<T>`) or shared memory (via `make_smem_ptr` or `make_smem_ptr<T>`).
-`Tensor`s that view existing memory can have either static or dynamic `Layout`s.
-
-Calling `print` on all of the above tensors displays
-```
-tensor_8     : ptr[32b](0x7f42efc00000) o _8:_1
-tensor_8s    : ptr[32b](0x7f42efc00000) o _8:_1
-tensor_8d2   : ptr[32b](0x7f42efc00000) o 8:2
-gmem_8s      : gmem_ptr[32b](0x7f42efc00000) o _8:_1
-gmem_8d      : gmem_ptr[32b](0x7f42efc00000) o 8:_1
-gmem_8sx16d  : gmem_ptr[32b](0x7f42efc00000) o (_8,16):(_1,_8)
-gmem_8dx16s  : gmem_ptr[32b](0x7f42efc00000) o (8,_16):(_16,_1)
-smem_4x8_col : smem_ptr[32b](0x7f4316000000) o (_4,_8):(_1,_4)
-smem_4x8_row : smem_ptr[32b](0x7f4316000000) o (_4,_8):(_8,_1)
-```
-
-which displays the pointer type along with any memory space tags, the pointer's `value_type` width, the raw pointer address, and the associated `Layout`.
-
-### Owning Tensors
-
-A `Tensor` can also be an owning array of memory.
-Owning `Tensor`s are created by calling `make_tensor<T>`,
-where `T` is the type of each element of the array, and
-a `Layout` or arguments to construct a `Layout`.
-The array is allocated analogously to `std::array<T,N>` and, therefore, owning `Tensor`s must be constructed with a `Layout` that has static shapes and static strides.
-CuTe does not perform dynamic memory allocation in `Tensor`s as it is not a common or performant operation within CUDA kernels.
-
-Here are some examples of creating owning `Tensor`s.
-
-```c++
-// Register memory (static layouts only)
-Tensor rmem_4x8_col = make_tensor<float>(Shape<_4,_8>{});
-Tensor rmem_4x8_row = make_tensor<float>(Shape<_4,_8>{},
-                                         LayoutRight{});
-Tensor rmem_4x8_pad = make_tensor<float>(Shape <_4, _8>{},
-                                         Stride<_32,_2>{});
-Tensor rmem_4x8_like = make_tensor_like(rmem_4x8_pad);
-```
-
-The `make_tensor_like` function makes an owning Tensor of register memory with the same value type and shape as its input `Tensor` argument and attempts to use the same order of strides as well.
-
-Calling `print` on each of the above tensors produces similar output
-
-```
-rmem_4x8_col  : ptr[32b](0x7fff48929460) o (_4,_8):(_1,_4)
-rmem_4x8_row  : ptr[32b](0x7fff489294e0) o (_4,_8):(_8,_1)
-rmem_4x8_pad  : ptr[32b](0x7fff489295e0) o (_4,_8):(_32,_2)
-rmem_4x8_like : ptr[32b](0x7fff48929560) o (_4,_8):(_8,_1)
-```
-
-and we can see that each pointer address is unique indicating that each `Tensor` is a unique array-like allocation.
-
-## Accessing a Tensor
-
-Users can access the elements of a `Tensor` via `operator()` and `operator[]`,
-which take `IntTuple`s of logical coordinates.
-
-When users access a `Tensor`,
-the `Tensor` uses its `Layout` to map the logical coordinate
-to an offset that can be accessed by the iterator.
-You can see this in `Tensor`'s implementation of `operator[]`.
-
-```c++
-template <class Coord>
-decltype(auto) operator[](Coord const& coord) {
-  return data()[layout()(coord)];
-}
-```
-
-For example, we can read and write to `Tensor`s using natural coordinates, using the variadic `operator()`, or the container-like `operator[]`.
-
-```c++
-Tensor A = make_tensor<float>(Shape <Shape < _4,_5>,Int<13>>{},
-                              Stride<Stride<_12,_1>,    _64>{});
-float* b_ptr = ...;
-Tensor B = make_tensor(b_ptr, make_shape(13, 20));
-
-// Fill A via natural coordinates op[]
-for (int m0 = 0; m0 < size<0,0>(A); ++m0)
-  for (int m1 = 0; m1 < size<0,1>(A); ++m1)
-    for (int n = 0; n < size<1>(A); ++n)
-      A[make_coord(make_coord(m0,m1),n)] = n + 2 * m0;
-
-// Transpose A into B using variadic op()
-for (int m = 0; m < size<0>(A); ++m)
-  for (int n = 0; n < size<1>(A); ++n)
-    B(n,m) = A(m,n);
-
-// Copy B to A as if they are arrays
-for (int i = 0; i < A.size(); ++i)
-  A[i] = B[i];
-```
-
-## Tiling a Tensor
-
-Many of the [`Layout` algebra operations](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/02_layout_algebra.md) can also be applied to `Tensor`.
-```cpp
-   composition(Tensor, Tiler)
-logical_divide(Tensor, Tiler)
- zipped_divide(Tensor, Tiler)
-  tiled_divide(Tensor, Tiler)
-   flat_divide(Tensor, Tiler)
-```
-The above operations allows arbitrary subtensors to be "factored out" of `Tensor`s. This very commonly used in tiling for threadgroups, tiling for MMAs, and reodering tiles of data for threads.
-
-Note that the `_product` operations are not implemented for `Tensor`s as those would
-often produce layouts with increased codomain sizes, which means the `Tensor` would
-require accessing elements unpredictably far outside its previous bounds. `Layout`s can be
-used in products, but not `Tensor`s.
-
-## Slicing a Tensor
-
-Whereas accessing a `Tensor` with a coordinate will return an element of that tensor,
-slicing a `Tensor` will return a subtensor of all the elements in the sliced mode(s).
-
-Slices are performed through the same `operator()`
-that are used for accessing an individual element.
-Passing in `_` (the underscore character, an instance of the `cute::Underscore` type)
-has the same effect as `:` (the colon character) in Fortran or Matlab:
-retain that mode of the tensor as if no coordinate had been used.
-
-Slicing a tensor performs two operations,
-* the `Layout` is evaluated on the partial coordinate and the resulting offset is accumulated into the iterator -- the new iterator points to the start of the new tensor.
-* the `Layout` modes cooresponding to `_`-elements of the coordinate are used to construct a new layout.
-Together, the new iterator and the new layout construct the new tensor.
-
-```cpp
-// ((_3,2),(2,_5,_2)):((4,1),(_2,13,100))
-Tensor A = make_tensor(ptr, make_shape (make_shape (Int<3>{},2), make_shape (       2,Int<5>{},Int<2>{})),
-                            make_stride(make_stride(       4,1), make_stride(Int<2>{},      13,     100)));
-
-// ((2,_5,_2)):((_2,13,100))
-Tensor B = A(2,_);
-
-// ((_3,_2)):((4,1))
-Tensor C = A(_,5);
-
-// (_3,2):(4,1)
-Tensor D = A(make_coord(_,_),5);
-
-// (_3,_5):(4,13)
-Tensor E = A(make_coord(_,1),make_coord(0,_,1));
-
-// (2,2,_2):(1,_2,100)
-Tensor F = A(make_coord(2,_),make_coord(_,3,_));
-```
-
-<p align="center">
-  <img src="../../images/cute/slice.png" alt="slice.png" height="300"/>
-</p>
-
-In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
-
-## Partitioning a Tensor
-
-To implement generic partitioning of a `Tensor`, we apply composition or tiling followed by a slicing. This can be performed in many ways, but we have found three ways that are particularly useful: inner-partitioning, outer-partitioning, and TV-layout-partitioning.
-
-### Inner and outer partitioning
-
-Let's take a tiled example and look at how we can slice it in useful ways.
-
-```cpp
-Tensor A = make_tensor(ptr, make_shape(8,24));  // (8,24)
-auto tiler = Shape<_4,_8>{};                    // (_4,_8)
-
-Tensor tiled_a = zipped_divide(A, tiler);       // ((_4,_8),(2,3))
-```
-
-Suppose that we want to give each threadgroup one of these 4x8 tiles of data. Then we can use our threadgroup coordinate to index into the second mode.
-
-```cpp
-Tensor cta_a = tiled_a(make_coord(_,_), make_coord(blockIdx.x, blockIdx.y));  // (_4,_8)
-```
-
-We call this an *inner-partition* because it keeps the inner "tile" mode. This pattern of applying a tiler and then slicing out that tile by indexing into the remainder mode is common and has been wrapped into its own function `inner_partition(Tensor, Tiler, Coord)`. You'll often see `local_tile(Tensor, Tiler, Coord)` which is just another name for `inner_partition`. The `local_tile` partitioner is very often applied at the threadgroup level to partition tensors into tiles across threadgroups.
-
-Alternatively, suppose that we have 32 threads and want to give each thread one element of these 4x8 tiles of data. Then we can use our thread to index into the first mode.
-
-```cpp
-Tensor thr_a = tiled_a(threadIdx.x, make_coord(_,_)); // (2,3)
-```
-
-We call this an *outer-partition* because it keeps the outer "rest" mode. This pattern of applying a tiler and then slicing into that tile by indexing into the tile mode is common and has been wrapped into its own function `outer_partition(Tensor, Tiler, Coord)`. Sometimes you'll see `local_partition(Tensor, Layout, Idx)`, which is a rank-sensitive wrapper around `outer_partition` that transforms the `Idx` into a `Coord` using the inverse of the `Layout` and then constructs a `Tiler` with the same top-level shape of the `Layout`. This allows the user to ask for a row-major, column-major, or arbitrary layout of threads with a given shape that can be used to partition into a tensor.
-
-To see how these partitioning patterns are used, see the [introductory GEMM tutorial](./0x_gemm_tutorial.md).
-
-### Thread-Value partitioning
-
-Another common partitioning strategy is called a thread-value partitioning. In this pattern, we construct a `Layout` that represents the mapping of all threads (or any parallel agent) and all values that each thread will receive to coordinates of the target data. With `composition` the target data layout is transformed according to our TV-layout and then we can simply slice into the thread-mode of the result with our thread index.
-
-```cpp
-// Construct a TV-layout that maps 8 thread indices and 4 value indices
-//   to 1D coordinates within a 4x8 tensor
-// (T8,V4) -> (M4,N8)
-auto tv_layout = Layout<Shape <Shape <_2,_4>,Shape <_2, _2>>,
-                        Stride<Stride<_8,_1>,Stride<_4,_16>>>{}; // (8,4)
-
-// Construct a 4x8 tensor with any layout
-Tensor A = make_tensor<float>(Shape<_4,_8>{}, LayoutRight{});    // (4,8)
-// Compose A with the tv_layout to transform its shape and order
-Tensor tv = composition(A, tv_layout);                           // (8,4)
-// Slice so each thread has 4 values in the shape and order that the tv_layout prescribes
-Tensor  v = tv(threadIdx.x, _);                                  // (4)
-```
-
-<p align="center">
-  <img src="../../images/cute/tv_layout.png" alt="tv_layout.png" height="300"/>
-</p>
-
-The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.
-
-To see how these partitioning patterns are constructed and used, see the [tutorial on building MMA Traits](./0t_mma_atom.md).
-
-## Examples
-
-### Copy a subtile from global memory to registers
-
-The following example copies rows of a matrix (with any `Layout`)
-from global memory to register memory,
-then executes some algorithm `do_something`
-on the row that lives in register memory.
-
-```c++
-Tensor gmem = make_tensor(ptr, make_shape(Int<8>{}, 16));  // (_8,16)
-Tensor rmem = make_tensor_like(gmem(_, 0));                // (_8)
-for (int j = 0; j < size<1>(gmem); ++j) {
-  copy(gmem(_, j), rmem);
-  do_something(rmem);
-}
-```
-
-This code does not need to know anything about the `Layout` of `gmem`
-other than that it is rank-2 and that the first mode has a static size.
-The following code checks both of those conditions at compile time.
-
-```c++
-CUTE_STATIC_ASSERT_V(rank(gmem) == Int<2>{});
-CUTE_STATIC_ASSERT_V(is_static<decltype(shape<0>(gmem))>{});
-```
-
-Extending this example using the tiling utilities detailed in [the `Layout` algebra section](./02_layout_algebra.md), we can copy an arbitrary subtile of a tensor using almost the same code as above.
-
-```c++
-Tensor gmem = make_tensor(ptr, make_shape(24, 16));         // (24,16)
-
-auto tiler         = Shape<_8,_4>{};                        // 8x4 tiler
-//auto tiler       = Tile<Layout<_8,_3>, Layout<_4,_2>>{};  // 8x4 tiler with stride-3 and stride-2
-Tensor gmem_tiled  = zipped_divide(gmem, tiler);            // ((_8,_4),Rest)
-Tensor rmem        = make_tensor_like(gmem_tiled(_, 0));    // ((_8,_4))
-for (int j = 0; j < size<1>(gmem_tiled); ++j) {
-  copy(gmem_tiled(_, j), rmem);
-  do_something(rmem);
-}
-```
-
-This applies a statically shaped `Tiler` to the global memory `Tensor`, creates an register `Tensor` that is compatible with the shape of that tile, then loops through each tile to copy it into memory and `do_something`.
-
-## Summary
-
-* `Tensor` is defined as an `Engine` and a `Layout`.
-
-    * `Engine` is an iterator that can be offset and dereferenced.
-    * `Layout` defines the logical domain of the tensor and maps coordinates to offsets.
-
-* Tile a `Tensor` using the same methods for tiling `Layout`s.
-
-* Slice a `Tensor` to retrieve subtensors.
-
-* Partitioning is tiling and/or composition followed by slicing.
diff --git a/executor/op-mem-cuda/docs/04_algorithms.md b/executor/op-mem-cuda/docs/04_algorithms.md
deleted file mode 100644
index f427e5ef..00000000
--- a/executor/op-mem-cuda/docs/04_algorithms.md
+++ /dev/null
@@ -1,223 +0,0 @@
-# CuTe Tensor algorithms
-
-This section summarizes the interfaces and implementations
-of common numerical algorithms performed on `Tensor`s.
-
-The implementation of these algorithms may be found in the
-[include/cute/algorithm/](../../../include/cute/algorithm/)
-directory.
-
-## `copy`
-
-CuTe's `copy` algorithm copies the elements of a source `Tensor`
-into the elements of a destination `Tensor`.
-The various overloads of `copy` can be found in
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
-
-### Interface and specialization opportunities
-
-A `Tensor` encapsulates the data type, data location,
-and possibly also the shape and stride of the tensor at compile time.
-As a result, `copy` can and does dispatch,
-based on the types of its arguments,
-to use any of various synchronous or asynchronous hardware copy instructions.
-
-The `copy` algorithm has two main overloads.
-The first just takes the source `Tensor` and the destination `Tensor`.
-
-```c++
-template <class SrcEngine, class SrcLayout,
-          class DstEngine, class DstLayout>
-CUTE_HOST_DEVICE
-void
-copy(Tensor<SrcEngine, SrcLayout> const& src,
-     Tensor<DstEngine, DstLayout>      & dst);
-```
-
-The second takes those two parameters, plus a `Copy_Atom`.
-
-```c++
-template <class... CopyArgs,
-          class SrcEngine, class SrcLayout,
-          class DstEngine, class DstLayout>
-CUTE_HOST_DEVICE
-void
-copy(Copy_Atom<CopyArgs...>       const& copy_atom,
-     Tensor<SrcEngine, SrcLayout> const& src,
-     Tensor<DstEngine, DstLayout>      & dst);
-```
-
-The two-parameter `copy` overload picks a default implementation
-based only on the types of the two `Tensor` parameters.
-The `Copy_Atom` overload lets callers override that default
-by specifying a nondefault `copy` implementation.
-
-### Parallelism and synchronization depend on parameter types
-
-Either the default implementation or
-the implementation selected by a `Copy_Atom` overload
-may use none or all available parallelism,
-and may have a variety of synchronization semantics.
-The behavior depends on `copy`'s parameter types.
-Users are expected to figure this out based on their knowledge
-of the architecture on which they are running.
-(Developers often write a custom optimized kernel
-for each GPU architecture.)
-
-The `copy` algorithm may be sequential per thread,
-or it may be parallel across some collection of threads
-(e.g., a block or cluster).
-
-If `copy` is parallel,
-then the collection of participating threads
-may need synchronization before any thread in the collection
-may assume that the copy operation has completed.
-For example, if the participating threads form a thread block,
-then users must invoke `__syncthreads()`
-or the Cooperative Groups equivalent
-before they may use the results of `copy`.
-
-The `copy` algorithm may use asynchronous copy instructions,
-such as `cp.async`, or its C++ interface `memcpy_async`.
-In that case, users will need to perform
-the additional synchronization appropriate to that underlying implementation
-before they may use the results of the `copy` algorithm.
-[The CuTe GEMM tutorial example](../../../examples/cute/tutorial/)
-shows one such synchronization method.
-More optimized GEMM implementations use pipelining techniques
-to overlap asynchronous `copy` operations with other useful work.
-
-### A generic copy implementation
-
-A simple example of a generic `copy` implementation
-for any two `Tensor`s looks like this.
-
-```c++
-template <class TA, class ALayout,
-          class TB, class BLayout>
-CUTE_HOST_DEVICE
-void
-copy(Tensor<TA, ALayout> const& src,  // Any logical shape
-     Tensor<TB, BLayout>      & dst)  // Any logical shape
-{
-  for (int i = 0; i < size(dst); ++i) {
-    dst(i) = src(i);
-  }
-}
-```
-
-This generic `copy` algorithm addresses both `Tensor`s
-with 1-D logical coordinates, thus traversing both `Tensor`s
-in a logical column-major order.
-Some reasonable architecture-independent optimizations
-would include the following.
-
-1. If the two `Tensor`s have known memory spaces with optimized
-   access instructions (like `cp.async`), then dispatch to the
-   custom instruction.
-
-2. The two `Tensor`s have static layouts and it can be proven
-that element vectorization is valid -- for example, four `ld.global.b32`s
-can be combined into a single `ld.global.b128` -- then vectorize the source
-   and destinations tensors.
-
-3. If possible, validate that the copy instruction to be used is
-   appropriate for the source and destination tensors.
-
-CuTe's optimized copy implementations can do all of these.
-
-## `copy_if`
-
-CuTe's `copy_if` algorithm lives in the same header as `copy`,
-[`include/cute/algorithm/copy.hpp`](../../../include/cute/algorithm/copy.hpp).
-The algorithm takes source and destination `Tensor` parameters like `copy`,
-but it also takes a "predication `Tensor`"
-with the same shape as the input and output.
-Elements of the source `Tensor` are only copied
-if the corresponding predication `Tensor` element is nonzero.
-
-For details on why and how to use `copy_if`,
-please refer to the
-["predication" section of the tutorial](./0y_predication.md).
-
-## `gemm`
-
-### What `gemm` computes
-
-The `gemm` algorithm takes three `Tensor`s, A, B, and C.
-What it does depends on the number of modes
-that its `Tensor` parameters have.
-We express these modes using letters.
-
-* V indicates a "vector," a mode of independent elements.
-
-* M and N indicate the number of rows resp. columns
-  of the matrix result C of the BLAS's GEMM routine.
-
-* K indicates the "reduction mode" of GEMM,
-  that is, the mode along which GEMM sums.
-  Please see the [GEMM tutorial](./0x_gemm_tutorial.md) for details.
-
-We list the modes of the input `Tensor`s A and B,
-and the output `Tensor` C,
-using a notation `(...) x (...) => (...)`.
-The two leftmost `(...)` describe A and B (in that order),
-and the `(...)` to the right of the `=>` describes C.
-
-1. `(V) x (V) => (V)`. The element-wise product of vectors: C<sub>v</sub> += A<sub>v</sub> B<sub>v</sub>. Dispatches to FMA or MMA.
-
-2. `(M) x (N) => (M,N)`. The outer product of vectors: C<sub>mn</sub> += A<sub>m</sub> B<sub>n</sub>. Dispatches to (4) with V=1.
-
-3. `(M,K) x (N,K) => (M,N)`. The product of matrices: C<sub>mn</sub> += A<sub>mk</sub> B<sub>nk</sub>. Dispatches to (2) for each K.
-
-4. `(V,M) x (V,N) => (V,M,N)`. The batched outer product of vectors: C<sub>vmn</sub> += A<sub>vm</sub> B<sub>vn</sub>. Optimizes for register reuse and dispatches to (1) for each M, N.
-
-5. `(V,M,K) x (V,N,K) => (V,M,N)`. The batched product of matrices: C<sub>vmn</sub> += A<sub>vmk</sub> B<sub>vnk</sub>. Dispatches to (4) for each K.
-
-Please refer to the [GEMM tutorial](./0x_gemm_tutorial.md)
-for an overview of CuTe's convention for ordering the modes.
-For example, if K appears, it always appears rightmost ("outermost").
-If V appears, it always appears leftmost ("innermost").
-
-### Dispatch to optimized implementations
-
-Just like with `copy`, CuTe's implementations of `gemm`
-uses its `Tensor` arguments' types to dispatch
-to an appropriately optimized implementation.
-Also like `copy`, `gemm` takes an optional `MMA_Atom` parameter
-that lets callers override the default `FMA` instruction
-that CuTe would select based on the `Tensor` arguments' types.
-
-For more information on `MMA_Atom` and on specialization of `gemm`
-for different architectures, please refer to the
-[MMA section of the tutorial](./0t_mma_atom.md).
-
-## `axpby`
-
-The `axpby` algorithm lives in the header file
-[`include/cute/algorithm/axpby.hpp`](../../../include/cute/algorithm/axpby.hpp).
-It assigns to $y$ the result of $\alpha x + \beta y$,
-where $\alpha$ and $\beta$ are scalars and $x$ and $y$ are `Tensor`s.
-The name stands for "Alpha times X Plus Beta times Y,"
-and is a generalization of the original BLAS "AXPY" routine
-("Alpha times X Plus Y").
-
-## `fill`
-
-The `fill` algorithm lives in the header file
-[`include/cute/algorithm/fill.hpp`](../../../include/cute/algorithm/fill.hpp).
-It overwrites the elements of its `Tensor` output argument
-with a given scalar value.
-
-## `clear`
-
-The `clear` algorithm lives in the header file
-[`include/cute/algorithm/clear.hpp`](../../../include/cute/algorithm/clear.hpp).
-It overwrites the elements of its `Tensor` output argument with zeros.
-
-## Other algorithms
-
-CuTe provides other algorithms.
-Their header files can be found in the
-[`include/cute/algorithm`](../../../include/cute/algorithm)
-directory.
diff --git a/executor/op-mem-cuda/docs/0t_mma_atom.md b/executor/op-mem-cuda/docs/0t_mma_atom.md
deleted file mode 100644
index 6f285c26..00000000
--- a/executor/op-mem-cuda/docs/0t_mma_atom.md
+++ /dev/null
@@ -1,527 +0,0 @@
-# CuTe's support for Matrix Multiply-Accumulate instructions
-
-In this file, we explain in detail how we support our GPUs'
-Matrix Multiply-Accumulate (MMA) hardware instructions in CuTe.
-
-MMAs are architecture-specific.
-Different generations of GPU architectures
-introduce different sets of MMA instructions.
-However, CuTe features such as `Layout`
-makes it possible to expose MMAs for use in generic CUDA C++ code.
-We accomplish this in multiple steps.
-
-1. We wrap each MMA's PTX instruction in an "Operation" struct.
-
-2. For each Operation struct, we define a "Traits" struct
-   that defines all of the meta-information needed to use the Operation.
-
-3. Combining the above, an "Atom" is the combination of the PTX Operation struct with the
-   meta-information Traits struct and provides methods to construct
-   `cute::Tensor` "fragments" for that Operation and to use that Operation
-   on existing `cute::Tensor`s.
-
-4. Combining potentially multiple Atoms, a "TiledMMA" provides utilities for building
-   more complex partitioning patterns by creating layouts and interleavings of Atoms.
-
-## CuTe MMA Atoms
-
-CuTe exposes each MMA to generic CUDA C++ code as a pair of structs:
-an "Operation" struct,
-and an `MMA_Traits` struct templated on the Operation struct type.
-
-An "Operation" struct exposes the PTX instruction
-for that specific operation.
-It defines the arguments and interface it expects.
-Operation structs have minimal software dependencies --
-they do not use layouts, tensors, or non-standard numeric data types -- and
-describe only the physical inputs and outputs to the instruction.
-Different structs have different names
-that describe what the MMA instruction does.
-We will explain the naming scheme below.
-
-A corresponding `MMA_Traits` struct specialization
-defines meta-information about the Operation,
-such as the logical compute types, the logical shape of the operation,
-and the `Layout`s of threads and values within the operation.
-The `MMA_Traits` struct takes the Operation as a template parameter.
-CuTe specializes `MMA_Traits` for each Operation type that it supports.
-
-Together, these two types comprise an "Atom" that decouples the complexity of thread and data layouts from the call site of the PTX instruction.  The Atom's Traits struct exposes information that is relevant to a single MMA operation, no matter the granularity at which it operates.
-
-CuTe MMA atoms expose the semantics of a single MMA operation.
-This is true regardless of the hardware level at which the MMA operates.
-CuTe supports MMA atoms that operate at a variety of hardware levels,
-including
-
-* a single thread (e.g., fused multiply-add (FMA) instruction);
-
-* a quadpair (Volta);
-
-* a single warp (Ampere); and
-
-* a warpgroup (Hopper).
-
-### Operation structs
-
-#### Location of files
-
-CuTe provides its Operations structs in the
-[`include/cute/arch`](../../../include/cute/arch)
-directory, in header files starting with `mma`.
-
-#### Operation struct's name
-
-A CuTe Operation struct's name principally encodes the PTX instruction it wraps.
-These often include
-
-* its first supported architecture,
-
-* the M, N, and K dimensions that it accepts,
-
-* the types that it takes, and
-
-* the arrangement of the A and B inputs.
-
-For example, the Volta section below will refer to the
-`SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp).
-
-* "SM70" refers to Volta.
-
-* "8x8x4" refers to M = 8, N = 8, and K = 4,
-  the dimensions of the MMA operation that the quadpair performs
-  (see below). This is reflected in the PTX as `.m8n8k4.`.
-
-* "F32F16F16F32" refers to the element types
-  of the four matrix operands A, B, C, and D.
-  An MMA computes D = C + A * B,
-  so we read the types from left to right:
-  D is F32 (`float`), A is F16 (half),
-  B is F16 (half), and C is F32 (`float`). This is reflected in the PTX instruction name as `.f32.f16.f16.f32`.
-
-* "NT" means that the PTX instruction is designed for inputs A as M-major (not transposed, column-major)
-  and inputs B as N-major (transposed, row-major). This is reflected in the PTX instruction name as `.col.row.`.
-
-#### Contents
-
-An Operation struct has the following members.
-
-##### Type aliases
-
-An Operation struct has four public type aliases:
-`DRegisters`, `ARegisters`, `BRegisters`, and `CRegisters`.
-For example, the `SM70_8x8x4_F32F16F16F32_NT` Operation struct defined in
-[`include/cute/arch/mma_sm70.hpp`](../../../include/cute/arch/mma_sm70.hpp)
-defines these as follows.
-
-```c++
-using DRegisters = float[8];
-using ARegisters = uint32_t[2];
-using BRegisters = uint32_t[2];
-using CRegisters = float[8];
-```
-
-This shows how many values each thread will pass into the PTX instruction
-for each of the matrices A, B, C, and D.  For this Operation,
-each thread passes 8 F32 values each for C and D (hence `float[8]`),
-and 4 F16 values each for A and B (hence `uint32_t[2]`;
-the instruction packs two 16-bit F16 values
-in each of the two 32-bit `uint32_t` values).
-
-##### `fma` static member device function
-
-An operation struct defines a public `static void fma` function.
-It is marked with the `CUTE_HOST_DEVICE` macro,
-which adds the `__host__ __device__` annotations.
-Different Operations define `fma` to take different numbers of arguments,
-depending on the PTX MMA instruction.
-The implementation protects use of the PTX instruction with a macro,
-and raises an `assert` if `fma` is called when the macro is not defined.
-This ensures that tests and examples that use this Operation in an Atom
-can still compile, even if the PTX instruction is not available.
-
-### Traits
-
-#### Location of files
-
-CuTe provides its Traits structs in the
-[`include/cute/atom`](../../../include/cute/atom)
-directory, in header files starting with `mma_traits`.
-
-#### Contents
-
-An `MMA_Traits` specialization defines the following public type aliases.
-
-* `ValTypeD`: Logical compute type of the D matrix
-
-* `ValTypeA`: Logical compute type of the A matrix
-
-* `ValTypeB`: Logical compute type of the B matrix
-
-* `ValTypeC`: Logical compute type of the C matrix
-
-* `Shape_MNK`: Logical MxNxK shape of the MMA operation
-
-* `ThrID`: Logical thread mapping within the single MMA operation
-  (specifying the thread, quadpair, warp, or warpgroup view)
-
-* `ALayout`: Mapping of (thread,value) pairs to coordinates in the MxK A matrix
-
-* `BLayout`: Mapping of (thread,value) pairs to coordinates in the NxK B matrix
-
-* `CLayout`: Mapping of (thread,value) pairs to coordinates in the MxN C matrix
-
-#### Example
-
-The specialization of MMA_Traits for the
-`SM70_8x8x4_F32F16F16F32_NT` Operation lives in the header file
-[`include/cute/atom/mma_traits_sm70.hpp`](../../../include/cute/atom/mma_traits_sm70.hpp).
-It looks like this.
-
-```c++
-template <>
-struct MMA_Traits<SM70_8x8x4_F32F16F16F32_NT>
-{
-  using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-
-  using Shape_MNK = Shape<_8,_8,_4>;
-  using ThrID   = SM70_QuadPair;
-  using ALayout = SM70_8x4_Col;
-  using BLayout = SM70_8x4_Col;
-  using CLayout = SM70_8x8_32b;
-};
-```
-
-The next section will explain these type aliases in detail.
-
-## Volta
-
-This and the following sections show examples of how to construct MMA atoms.
-We don't try to explain this for all GPU architectures and MMAs.
-Instead, we use selected examples to illustrate the process
-of developing new atoms.
-
-Volta architecture implements an HMMA instruction where a group of 8 threads called a quadpair (QP) collaborate to share data and perform an 8x8x4 (fp32 or fp16) matrix multiply-accumulate. (since a warp is 32 threads wide, it would perform an MMA across 4 QPs for a tile size of 16x16x4).
-
-We first take a look at how we would take the ISA semantics of thread and data partitioning for the HMMA instruction, and encode it in a Traits struct. The HMMA NT instruction has the thread-data layout:
-
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT.png" alt="HMMA.8x8x4.NT.png" height="400"/>
-</p>
-
-### Types
-
-The HMMA NT above uses types:
-
-```cpp
-  using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-```
-
-The rest of the `MMA_Traits` will be described in units of these types.
-
-### Shape
-
-The HMMA NT above has shape 8x8x4:
-
-```cpp
-  // Logical shape of the MMA
-  using Shape_MNK = Shape <_8,_8,_4>;
-```
-
-### Thread ID
-
-If the 32 threads in a warp are logically indexed by [0 ... 31], then the above image contains threads [0,1,2,3]U[16,17,18,19]. These threads make up the 0th quadpair. We can write a thread mapping that maps eight logical thread ids [0,1,2,3,4,5,6,7] of the MMA to a quadpair thread index [0,1,2,3]U[16,17,18,19] of a warp. The layout function has 4 elements with a stride of 1 and 2 of those with a stride of 16. With this, we write a layout that represents a quadpair:
-
-```cpp
-  // Mapping from (logical thread id) -> (thread idx)
-  using ThrID = Layout<Shape <_4, _2>,
-                       Stride<_1,_16>>;
-```
-
-Again, this layout function maps the logical thread id [0,8) of the MMA operation onto the quadpair thread index [0,4)U[16,20) of a warp.
-
-### Accumulator Mapping
-
-Let us look at exactly how the 8 threads within a QP are mapped to the A, B and C matrices. For the C and D matrices, the above image is broken down a bit more below. On the left is shown the whole QP level view, and on the right is shown the values owned by just thread 0.
-
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.quadpair.C.png" alt="HMMA.8x8x4.quadpair.C.png" height="400"/>
-</p>
-
-The metainformation of this single instruction level view is what we want to encode in CuTe. Specifically, the QP level view in this diagram corresponds to the four MMA traits for [SM70_F32F16F16F32](../../../include/cute/arch/mma_sm70.hpp). These structs contain the `Element` types, the `Shape_MNK`, and the `ThrID` mapping we constructed above. Now, let us take a look at the definition of `CLayout`, the thread-data layout of accumulators. The job of `CLayout` is to construct a mapping between the `(logical_thr_id, logical_val_id)` and `(m, n)` coordinate in the C matrix which can then be used to build up more complicated layouts and operations like the 16x16x4 WMMA.
-
-We can start constructing a `CLayout` from the picture above. As with any CuTe layout, it is a pair of `Shape` and corresponding `Stride`. Let us just look at the shape for now. We know that the HMMA uses 8 threads each of which own 8 values. Therefore, the shape of our mapping must have a size of 8 along two modes. With this, we have
-
-```cpp
-  // (T8,V8) -> (m,n)
-  using CLayout = Layout<Shape <_8, _8>,
-                         Stride<_?, _?>;  // Stride to be filled in below
-```
-
-This is not to be confused with the logical 8x8 shape of the C matrix. This is 8-threads by 8-values. We now want to map those to (m,n) coordinates. Since CuTe layouts return indices rather than coordinates, we choose a column-major encoding of the (m,n) coordinates:
-
-```
-(logical_thr_id, logical_val_id) -> (m, n) == m + n * M
-```
-
-With this in place, we can start thinking about how to construct the strides in `CLayout`. Let's begin by looking at the strides between threads. Note that
-* `(T0,V0)` is located at `(m,n) = (0,0) = 0`
-* `(T1,V0)` is located at `(m,n) = (1,0) = 1`
-* `(T2,V0)` is located at `(m,n) = (0,2) = 16`
-* `(T3,V0)` is located at `(m,n) = (1,2) = 17`
-* `(T4,V0)` is located at `(m,n) = (4,0) = 4`
-* `(T5,V0)` is located at `(m,n) = (5,0) = 5`
-* `(T6,V0)` is located at `(m,n) = (4,2) = 20`
-* `(T7,V0)` is located at `(m,n) = (5,2) = 21`
-
-where `T4`,`T5`,`T6`,`T7` are the 4th,5th,6th,7th logical thread id of the MMA corresponding to thread indices of 16,17,18,19 of the warp (recorded in the `ThrID` mapping!).
-
-We note that the pattern can be transcribed to a layout. We can find the position of the 8 threads via
-
-```cpp
-  using CLayout = Layout<Shape <Shape <_2,  _2, _2>, _8>,
-                         Stride<Stride<_1, _16, _4>, _?>;
-```
-
-With the exact same approach, we can construct the stride along the `logical value id` mode.
-* `(T0,V0)` is located at `(m,n) = (0,0) = 0`
-* `(T0,V1)` is located at `(m,n) = (0,1) = 8`
-* `(T0,V2)` is located at `(m,n) = (2,0) = 2`
-* `(T0,V3)` is located at `(m,n) = (2,1) = 10`
-* `(T0,V4)` is located at `(m,n) = (0,4) = 32`
-* `(T0,V5)` is located at `(m,n) = (0,5) = 40`
-* `(T0,V6)` is located at `(m,n) = (2,4) = 34`
-* `(T0,V7)` is located at `(m,n) = (2,5) = 42`
-
-We note that this pattern can also be transcribed to a layout. We can find the position of the 8 values via
-
-```cpp
-  // (T8,V8) -> (m,n)
-  using CLayout = Layout<Shape <Shape <_2, _2,_2>, Shape <_2,_2, _2>>,
-                         Stride<Stride<_1,_16,_4>, Stride<_8,_2,_32>>>;
-```
-
-And that's all! We can verify that each `(tid,vid)` coordinate in this layout is reliably mapped to the correct (encoded) `(m,n)` coordinate.
-
-In the case of F16 accumulators, the layout is way less complex. Each row of accumulators `(m, :)` is held by a single thread, which makes the layout:
-
-```cpp
-  using CLayout = Layout<Shape <_8,_8>,
-                         Stride<_1,_8>>;
-```
-
-### A and B Layout Mapping
-
-A and B matrix layouts depend on whether the sources are transposed or not. The diagram below shows the thread ID to data ownership map for A and B matrices in the case of NT and TN transposes.
-
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.quadpair.AB.png" alt="HMMA.8x8x4.quadpair.AB.png" height="400"/>
-</p>
-
-Let's look at the TN layout for A matrix first (right side in the diagram). Again, there are the same 8 logical threads, but each threads owns only 4 elements this time. The shape of `ALayout` will then be `Shape<_8, _4>`. As for the strides, we again need a similar mapping between `(m, k) == m + k * M`. Looking down the `M` mode, we go from `(T0, V0)` to `(T1, V0)` which is a stride of 1 for all 8 threads. For the `K` mode, as we go across, we go from `(T0, V0)` to `(T0, V1)`, which makes a stride of 8 for all 4 values. Therefore, the A layout is:
-
-```cpp
-  // (T8,V4) -> (m,k)
-  using ALayout = Layout<Shape <_8,_4>,
-                         Stride<_1,_8>>;
-```
-
-Source B layout is constructed similarly for the TN HMMA, except that we want write it as `(N,K)` rather than `(K,N)` for convenience. For the strides, as we go across the `N` mode, we go from `(T0, V0)` to `(T1, V0)`, making this a stride of 1 for all 8 threads. As we go down the `K` mode, `(T0, V0)` to `(T0, V1)` which is a stride of 8 for all 4 values. So the B layout is the same as A:
-
-```cpp
-  // (T8,V4) -> (n,k)
-  using BLayout = Layout<Shape <_8,_4>,
-                         Stride<_1,_8>>;
-```
-
-The layouts in the case of NT are a bit more complicated (left side of the diagram). Going down the `M` mode of `A`, we see the four values of `T0` first and then we see the four values of `T4`. This means we first have a stride of 1 for 4 values, followed by a stride of 4 from `T0` to `T4`. So we have two sub-strides along the `M` mode. For the `K` mode, as we go across, we simply increment the `thr_id`, keeping `val_id` the same, making the stride 8 for 4 threads. This makes the A layout:
-
-```cpp
-  // (T8,V4) -> (m,k)
-  using ALayout = Layout<Shape <Shape <_4,_2>,_4>,
-                         Stride<Stride<_8,_4>,_1>>;
-```
-
-With the `(N,K)` ordering for B, the layout is the same.
-
-```cpp
-  // (T8,V4) -> (n,k)
-  using BLayout = Layout<Shape <Shape <_4,_2>,_4>,
-                         Stride<Stride<_8,_4>,_1>>;
-```
-
-For the NN and TT transposes, they are simply combinations of the two layouts we have seen for A and B so far.
-
-## Hopper
-
-Now, we are ready to take a look at the much larger GMMA operation (Group MMA) first introduced with Hopper architecture. These MMA instructions operate at the granularity of 128 threads (4 warps), which are collectively referred to as a warpgroup.
-
-### Thread ID
-
-In the case of Hopper GMMAs, the thread IDs are assigned based on the simple 1D contiguous layout, which makes `thrID` trivial:
-
-```cpp
-using ThrID = Layout<_128, _1>;
-```
-
-### Accumulator Mapping
-
-Accumulators are mapped hierarchically in GMMA, starting from the concept of a core matrix and building up to a layout for the whole C matrix tile. Let's look at this core matrix first. We only consider fp16 accumulators here, but extensions of fp32 accumulators as trivial as we will see later.
-
-Each core matrix has the layout as shown in the diagram below.
-<p align="center">
-  <img src="../../images/cute/gmma_coremat_cd_fp16.png" alt="gmma_coremat_cd_fp16.png" height="600"/>
-</p>
-
-As in the Volta examples, the thread IDs are logical only, and which of the four warps they belong to in the warpgroup is not important.
-
-Then GMMA tiles this core matrix first vertically along the M mode, and then repeats that column of core matrices along the N mode to construct the full MxN tile. This tiling is shown in the image below.
-
-<p align="center">
-  <img src="../../images/cute/gmma_wg_n_slice.png" alt="gmma_wg_n_slice.png" height="600"/>
-</p>
-
-With this image, we are again ready to start building the `CLayout` for `SM90_64x128x16_F16F16F16F16_TN` atom. Same as before, we are constructing a mapping between the `(logical_thr_id, logical_val_id) -> (m, n)` coordinate spaces.
-
-To begin, let's follow the first few threads and values. We immediately see that they are arranged along the `N`-mode with pairs of values and four threads. This gives us
-
-```cpp
-// (T128,V4) -> (M64,N8)
-using CLayout = Layout<Shape <Shape <  _4, ...>, Shape < _2, ...>>,
-                       Stride<Stride<_128, ...>, Stride<_64, ...>>>;
-```
-
-To complete the first 8x8 core matrix, the four threads repeat eight times down the `M`-mode:
-
-```cpp
-// (T128,V4) -> (M64,N8)
-using CLayout = Layout<Shape <Shape <  _4, _8, ...>, Shape < _2, ...>>,
-                       Stride<Stride<_128, _1, ...>, Stride<_64, ...>>>;
-```
-
-Then, as we go to the next core matrix, we wrap back again to `T0`, but this time to `(T0, V2)`.
-
-```cpp
-// (T128,V4) -> (M64,N8)
-using CLayout = Layout<Shape <Shape <  _4, _8, ...>, Shape < _2, _2>>,
-                       Stride<Stride<_128, _1, ...>, Stride<_64, _8>>>;
-```
-
-Finally, we get this entire pattern repeating four times, once for each warp, down the `M`-mode starting at `(m,n) = (16,0) = 16`. where two core matrices that belong to the same warp are stacked on top of each other. This makes the size of the final sub-mode of M 4. As for the stride, this time we go to `(T32, V0)`, which makes it a stride of 32.
-
-```cpp
-// (T128,V4) -> (M64,N8)
-using CLayout = Layout<Shape <Shape <  _4, _8,  _4>, Shape < _2, _2>>,
-                       Stride<Stride<_128, _1, _16>, Stride<_64, _8>>>;
-```
-
-This is the full `CLayout` for 64x8 accumulators. The GMMA instructions include 64xN variants with `N = [16,32,64,128,256]` where this 64x8 pattern is repeated giving each thread additional values. As this starts at `(m,n) = (0,8) = 512`, this is easy to account for in our `CLayout`. For example, the 64x128 `CLayout` is
-
-```cpp
-// (T128,V64) -> (M64,N128)
-using CLayout = Layout<Shape <Shape <  _4, _8,  _4>, Shape < _2, _2,  _16>>,
-                       Stride<Stride<_128, _1, _16>, Stride<_64, _8, _512>>>;
-```
-
-where we see 16 copies of the 64x8 tile.
-
-### A and B Layout Mapping
-
-GMMA atoms that consume A and B sources directly from shared memory are a bit interesting. The GMMA Descriptor is constructed on an entire tile of A and/or B data in shared memory rather than being partitioned by threads. That is, every thread sees the entire tile of data and the tile is not reordered so that the descriptor can be constructed on it. In `ALayout` form, this can be expressed
-
-```cpp
-// (T128,V64x8) -> (M64,K16)
-using ALayout = Layout<Shape <_128, Shape <_64,_16>>,
-                       Stride<  _0, Stride< _1,_64>>>;
-```
-
-That is, all threads are mapped the to `(m,k) = (0,0) = 0` element and the values (and shape of the values) remains unchanged. The GMMA Descriptor Constructor can then inspect the `(M,K)` layout of this data and create an appropriate GMMA Descriptor or produce an error message saying the data is in an invalid layout for GMMA.
-
-## `TiledMMA`s
-
-We can make more complex patterns by combining and interleaving multiple atoms.
-
-Let's start with `SM70_8x8x4_F32F16F16F32_NT`.
-```cpp
-MMA_Atom mma = MMA_Atom<SM70_8x8x4_F32F16F16F32_NT>{};
-print_latex(mma);
-```
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_Atom.png" alt="HMMA.8x8x4.NT_Atom.png" height="400"/>
-</p>
-
-The above is equivalent to 
-```cpp
-    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
-                                  Layout<Shape<_1,_1,_1>>{},   // Layout of Atoms
-                                  Tile<_8,_8,_4>{});           // Tiler
-    print_latex(mma);
-```
-as it is a single atom and has a natural tile size of 8x8x4.
-
-We can create an object akin to a WMMA by using four of these quadpair MMAs:
-```cpp
-    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
-                                  Layout<Shape <_2,_2>,
-                                         Stride<_2,_1>>{});   // 2x2 n-major layout of Atoms
-    print_latex(mma);
-```
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2.png" alt="HMMA.8x8x4.NT_2x2.png" height="400"/>
-</p>
-This `TiledMMA` replicates the `MMA_Atom` across threads as we can see the `T4` and `T8` and `T12` threads in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the atom's partitioning pattern for a new quadpair and this replication follows a `(2,2):(2,1)` layout.
-
-The above represents a 16x16x4 MMA now, but we can immediately expand this "tile size" up to 32x32x4 instead:
-```cpp
-    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
-                                  Layout<Shape <_2,_2>,
-                                         Stride<_2,_1>>{},  // 2x2 n-major layout of Atoms
-                                  Tile<_32,_32,_4>{});      // 32x32x4 tiler
-    print_latex(mma);
-```
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2_32x32x4.png" alt="HMMA.8x8x4.NT_2x2_32x32x4.png" height="400"/>
-</p>
-This `TiledMMA` replicates the previous `TiledMMA` across values instead of threads. We can see the `T0V8` and `T16V8` and `T8V8` values in the `C`-matrix that were not used before. Each quadrant of the `C`-matrix is a replica of the previous `TiledMMA`'s partitioning pattern for a new set of values.
-
-Continuing, we see that there are eight values that `T0` receives from the `A`-matrix. Those reads occur at coordinates
-```
-T0V0 => ( 0,0)
-T0V1 => ( 1,0)
-T0V2 => ( 2,0)
-T0V3 => ( 3,0)
-T0V4 => (16,0)
-T0V5 => (17,0)
-T0V6 => (18,0)
-T0V7 => (19,0)
-```
-which are separate, but we might prefer them to be next to each other. That is we would like to permute the `M`-mode to create another valid `TiledMMA`.
-
-```cpp
-    TiledMMA mma = make_tiled_mma(SM70_8x8x4_F32F16F16F32_NT{},
-                                  Layout<Shape <_2,_2>,
-                                         Stride<_2,_1>>{},       // 2x2 n-major layout of Atoms
-                                  Tile<Layout<Shape <_4,_4,_2>,
-                                              Stride<_1,_8,_4>>, // Permutation on M, size 32
-                                       _32,                      // Permutation on N, size 32 identity
-                                       _4>{});                   // Permutation on K, size 4 identity
-    print_latex(mma);
-```
-<p align="center">
-  <img src="../../images/cute/HMMA.8x8x4.NT_2x2_32Mx32x4.png" alt="HMMA.8x8x4.NT_2x2_32Mx32x4.png" height="400"/>
-</p>
-
-That layout `(4,4,2):(1,8,4)` is read like a scatter permutation, telling the m-coords of the original image where to go in the new image.
-```
-old m-coord:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
-new m-coord:  0  1  2  3  8  9 10 11 16 17 18 19 24 25 26 27  4  5  6  7 12 13 14 15 20 21 22 23 28 29 30 31
-```
-This permutes only the M-mode (in `A` and `C` accordingly) and brings the access of all threads to be contiguous in m-coordinates in the `A`-matrix. This is convenient when designing layouts for shared memory or registers, for example. The MMA instructions contained within the image above are now effectively interleaved in the logical m-coordinates. Of course, permutations in the N-mode and K-mode are also valid.
-
-To see how these `TiledMMA`s are used to partition data tensors, see the [`0x_gemm_tutorial.md`](./0x_gemm_tutorial.md).
diff --git a/executor/op-mem-cuda/docs/0x_gemm_tutorial.md b/executor/op-mem-cuda/docs/0x_gemm_tutorial.md
deleted file mode 100644
index 008bc528..00000000
--- a/executor/op-mem-cuda/docs/0x_gemm_tutorial.md
+++ /dev/null
@@ -1,538 +0,0 @@
-# CuTe dense matrix-matrix multiply tutorial
-
-In this section, we review
-[these examples](../../../examples/cute/tutorial/),
-which demonstrate a few self-contained, single-file dense matrix-matrix multiply implementations using only CuTe.
-
-## `sgemm_1.cu`
-
-The simplest of the tutorial examples covers the basics of partitioning the global memory into tiles across the CTAs (also called threadblocks in CUDA), partitioning the data tiles across the threads of each CTA, and writing a mainloop using `cute::copy` and `cute::gemm`.
-
-### High-level interface
-
-We'll start with the kernel entry point `gemm_device` at the top of the file.
-
-```c++
-template <class ProblemShape, class CtaTiler,
-          class TA, class AStride, class ASmemLayout, class AThreadLayout,
-          class TB, class BStride, class BSmemLayout, class BThreadLayout,
-          class TC, class CStride, class CSmemLayout, class CThreadLayout,
-          class Alpha, class Beta>
-__global__ static
-__launch_bounds__(decltype(size(CThreadLayout{}))::value)
-void
-gemm_device(ProblemShape shape_MNK, CtaTiler cta_tiler,
-            TA const* A, AStride dA, ASmemLayout sA_layout, AThreadLayout tA,
-            TB const* B, BStride dB, BSmemLayout sB_layout, BThreadLayout tB,
-            TC      * C, CStride dC, CSmemLayout          , CThreadLayout tC,
-            Alpha alpha, Beta beta)
-```
-
-There are many template parameters, let's quickly review them and then go into more depth on their uses.
-
-* `ProblemShape`. The MxNxK problem shape of this matrix multiply.
-
-* `CtaTiler`. A CuTe [tiler concept](./02_layout_algebra.md#composition-tilers) that determines how to extract a tile of data from the problem shape.
-
-* `TA const* A`, `TB const* B`, `TC* C`. The types and pointers to the A, B, and C data, respectively.
-
-* `AStride`, `BStride`, `CStride`. The layout strides corresponding to the `ProblemShape` for each A, B, and C.
-
-* `ASmemLayout`, `BSmemLayout`, `CSmemLayout`. The layouts, if needed, of shared memory to use for staging A-data, B-data, and C-data within each CTA.
-
-* `AThreadLayout`, `BThreadLayout`, `CThreadLayout`. The layouts of threads to be used in partitioning each stage.
-
-* `Alpha alpha`, `Beta beta`. The types and values of the scalar constants to compute GEMM: `C = alpha * A * B + beta * C`.
-
-### The Full Tensors: Shapes, Strides, and Data
-
-Most GEMM interfaces list the matrices' dimensions
-in the order M, N, K. CuTe also uses this convention, but packages them
-into a single `IntTuple`. In this example, they are dynamic values
-defined at the top of the `gemm_nt` and `gemm_tn` host functions
-that invoke the device kernel.
-```cpp
-  // Define shapes (dynamic)
-  auto M = int(m);
-  auto N = int(n);
-  auto K = int(k);
-  auto prob_shape = make_shape(M, N, K);    // (M, N, K)
-```
-
-Inside the kernel, the problem shape is checked against the preconditions and then used to construct each of the full matrices.
-```cpp
-  // Preconditions
-  CUTE_STATIC_ASSERT_V(rank(shape_MNK) == Int<3>{});                      // (M, N, K)
-
-  CUTE_STATIC_ASSERT_V(congruent(select<0,2>(shape_MNK), dA));            // dA strides for shape MK
-  CUTE_STATIC_ASSERT_V(congruent(select<1,2>(shape_MNK), dB));            // dB strides for shape NK
-  CUTE_STATIC_ASSERT_V(congruent(select<0,1>(shape_MNK), dC));            // dC strides for shape MN
-
-  // Represent the full tensors
-  Tensor mA = make_tensor(make_gmem_ptr(A), select<0,2>(shape_MNK), dA);  // (M,K)
-  Tensor mB = make_tensor(make_gmem_ptr(B), select<1,2>(shape_MNK), dB);  // (N,K)
-  Tensor mC = make_tensor(make_gmem_ptr(C), select<0,1>(shape_MNK), dC);  // (M,N)
-```
-The appropriate modes of the `Shape` are selected to construct each of the tensors. The preconditions make sure that for every integer in the `Shape` there is a corresponding integer in the associated `Stride`.
-
-Note that the comment after B says `(N,K)` rather than `(K,N)`.
-This means that B is treated as an NxK matrix instead of a KxN matrix as is typical within BLAS and most other matrix-matrix multiplications.
-CuTe follows the convention that the semantics of matrix modes is
-`(M,K)` for `A`, `(N,K)` for `B`, and `(M,N)` for `C`, which we try to record in comments everywhere.
-
-For each of the `(M,K)`, `(N,K)`, and `(M,N)` tensors, the `gemm_nt` and `gemm_tn` construct the strides those tensors will use. In `gemm_nt` the strides are defined as
-```cpp
-  // Define NT strides (mixed)
-  auto dA = make_stride(Int<1>{}, ldA);    // (dM, dK)
-  auto dB = make_stride(Int<1>{}, ldB);    // (dN, dK)
-  auto dC = make_stride(Int<1>{}, ldC);    // (dM, dN)
-```
-and in `gemm_tn` the strides are defined as
-```cpp
-  // Define TN strides (mixed)
-  auto dA = make_stride(ldA, Int<1>{});    // (dM, dK)
-  auto dB = make_stride(ldB, Int<1>{});    // (dN, dK)
-  auto dC = make_stride(Int<1>{}, ldC);    // (dM, dN)
-```
-
-#### Aside: M-major, N-major, K-major
-
-We've found that the BLAS convention of using "non-transposed" (N) and "transposed" (T) flags in conjunction with the mode conventions of `MxK * KxN` to confuse the core issue of "what layout does this matrix use" and "in which mode does my matrix have a stride-1?". Indeed, the answer to those questions can always be found by inspecting the CuTe `Layout`.
-
-Instead of row-major or column-major (or Transposed
-and Not-Transposed), we have found it much more convenient to say that a matrix is "M-major" if it is stride-1 in the M-mode, "N-major" if it is stride-1 in the N-mode, or "K-major" if it is stride-1 in the K-mode. Furthermore, knowing that matrix multiply always performs a reduction in the K-mode, it is very convenient from a software perspective to always have the K-mode in the same place and adopt the mode convention `MxK * NxK`. Implementations will always reduce over the second mode (the K mode) of both input matrices and leads to cases where implementations can treat both input matrices the same way.
-
-How do we translate this into the BLAS user's experience?
-
-| BLAS | A Majorness | A Layout        | B Majorness | B Layout        |
-| ---  | ---         | ---             | ---         | ---             |
-| NT   | M-major     | `(M,K):(1,ldA)` | N-major     | `(N,K):(1,ldB)` |
-| TN   | K-major     | `(M,K):(ldA,1)` | K-major     | `(N,K):(ldB,1)` |
-| NN   | M-major     | `(M,K):(1,ldA)` | K-major     | `(N,K):(ldB,1)` |
-| TT   | K-major     | `(M,K):(ldA,1)` | N-major     | `(N,K):(1,ldB)` |
-
-Regardless, we'll still use the BLAS "NT" and "TN" notations for high-level descriptions of kernels when it's appropriate.
-
-### CTA Partitioning
-
-Now that we have the representations of the full matrices, it's time to tile them and split up the work!
-
-At the highest level, the work is distributed across CTAs. In principle, each CTA's tile could come from the input tensors in many different ways. Many [CuTe `Tiler`s](./02_layout_algebra.md#composition-tilers) could be used to tile the data, but for these cases it is sufficient to simply use the shape of the desired CTA tile.
-```cpp
-  // Define CTA tile sizes (static)
-  auto bM = Int<128>{};
-  auto bN = Int<128>{};
-  auto bK = Int<  8>{};
-  auto cta_tiler = make_shape(bM, bN, bK);  // (BLK_M, BLK_N, BLK_K)
-```
-
-Once the tiler has been defined, we can use it to tile and partition the tensors across the CTAs.
-
-```cpp
-  // Get the appropriate blocks for this threadblock
-  auto cta_coord = make_coord(blockIdx.x, blockIdx.y, _);              // (m,n,k)
-  Tensor gA = local_tile(mA, cta_tiler, cta_coord, Step<_1, X,_1>{});  // (BLK_M,BLK_K,k)
-  Tensor gB = local_tile(mB, cta_tiler, cta_coord, Step< X,_1,_1>{});  // (BLK_N,BLK_K,k)
-  Tensor gC = local_tile(mC, cta_tiler, cta_coord, Step<_1,_1, X>{});  // (BLK_M,BLK_N)
-```
-
-First, the CTA coordinate is created.
-* The `m`-coordinate of this tile is given by `blockIdx.x`.
-* The `n`-coordinate of this tile is given by `blockIdx.y`.
-* The `k`-coordinate of this tile is unspecified -- we want all of the tiles in `K` so the coordinate is `_`, the `Underscore` value, to keep that mode.
-
-Then, `local_tile` is used to remove the modes of the tiler and coord corresponding to the `X`s. That is, the `Step<_1, X,_1>` is just shorthand for
-```cpp
-  // Use select<0,2> to use only the M- and K-modes of the tiler and coord
-  Tensor gA = local_tile(mA, select<0,2>(cta_tiler), select<0,2>(cta_coord));
-```
-This `local_tile` is simply shorthand for
-1. apply the tiler via [`zipped_divide`](./02_layout_algebra.md#zipped-tiled-flat-divides)
-```cpp
-// ((BLK_M,BLK_K),(m,k))
-Tensor gA_mk = zipped_divide(mA, select<0,2>(cta_tiler));
-```
-2. apply the coord to the second mode, the "Rest" mode, to extract out the correct tiles for this CTA.
-```cpp
-// (BLK_M,BLK_K,k)
-Tensor gA = gA_mk(make_coord(_,_), select<0,2>(cta_coord));
-```
-Because the projections of the tiler and coord are symmetric and the two steps (apply a tiler and then slice into the rest-mode to produce a partition) are so common, they are wrapped together into the projective `local_tile` interface.
-
-For tensor `A`, we are left with a rank-3 tensor of shape `(BLK_M,BLK_K,k)`. The first two modes are precisely the modes of the CTA tile and the last mode indexes over all of the tiles that will be reduced by this CTA. In the mainloop section below, this mode is iterated over via the `k_tile` loop.
-
-### SMEM tensors
-
-The shared memory layouts that are used to hold the tiles of data for A and B are also passed in as the parameters `ASmemLayout sA_layout` and `BSmemLayout sB_layout`.
-
-These are defined in `gemm_nt` as
-```c++
-  // Define the smem layouts (static)
-  auto sA = make_layout(make_shape(bM, bK));   // (m,k) -> smem_idx; m-major
-  auto sB = make_layout(make_shape(bN, bK));   // (n,k) -> smem_idx; n-major
-```
-which produces simple M-major and N-major layouts. In `gemm_tn` these are
-```cpp
-  // Define the smem layouts (static)
-  auto sA = make_layout(make_shape(bM,bK), LayoutRight{});   // (m,k) -> smem_idx; k-major
-  auto sB = make_layout(make_shape(bN,bK), LayoutRight{});   // (n,k) -> smem_idx; k-major
-```
-which produces simple K-major layouts.
-
-As is evident, these smem layouts can be almost anything. Inside the kernel, they are checked for only two properties: the shared memory layouts are static and they are the same top-level shape as the `CtaTiler`.
-
-```cpp
-  // Preconditions
-  static_assert(is_static<ASmemLayout>::value);
-  static_assert(is_static<BSmemLayout>::value);
-  static_assert(is_static<CSmemLayout>::value);
-
-  CUTE_STATIC_ASSERT_V(size<0>(ASmemLayout{}) == size<0>(cta_tiler));  // BLK_M
-  CUTE_STATIC_ASSERT_V(size<0>(CSmemLayout{}) == size<0>(cta_tiler));  // BLK_M
-  CUTE_STATIC_ASSERT_V(size<0>(BSmemLayout{}) == size<1>(cta_tiler));  // BLK_N
-  CUTE_STATIC_ASSERT_V(size<1>(CSmemLayout{}) == size<1>(cta_tiler));  // BLK_N
-  CUTE_STATIC_ASSERT_V(size<1>(ASmemLayout{}) == size<2>(cta_tiler));  // BLK_K
-  CUTE_STATIC_ASSERT_V(size<1>(BSmemLayout{}) == size<2>(cta_tiler));  // BLK_K
-```
-
-Use of static layouts has a few advantages.
-* Static layouts let us statically allocate shared memory as shown below.
-* Static layouts are often more efficient and allow CuTe to dispatch to optimized implementations.
-* Static layouts makes it easier to prove correctness of the algorithm and provide checks like the above -- the smem layout sizes are the same as the CTA tile sizes.
-
-As stated, the shared memory layouts can be anything that satisfy those conditions. Optimizing kernels like these is often performed by finding a good shared memory layout that provides good access patterns for both the writes to and the reads from shared memory. This includes the ability to vectorize reads and writes as well as avoid shared memory bank conflicts.
-
-With the static smem layouts, the `gemm_device` kernel can allocate the required shared memory and create the smem `Tensor`s.
-
-```cpp
-  // Shared memory buffers
-  __shared__ TA smemA[cosize_v<ABlockLayout>];
-  __shared__ TB smemB[cosize_v<BBlockLayout>];
-  Tensor sA = make_tensor(make_smem_ptr(smemA), sA_layout);  // (BLK_M,BLK_K)
-  Tensor sB = make_tensor(make_smem_ptr(smemB), sB_layout);  // (BLK_N,BLK_K)
-```
-
-Note how the shared memory allocation depends only on the data type and the layout. What's a `cosize`? Because a `Layout` is a function, we can speak of its domain and codomain. The `size` of a layout is the size of its domain and the `cosize` of a layout is the size of its codomain. If we want to allocate an array for which all the offsets produced by a layout are valid, then we can use the `cosize` of the layout as the length of the array (in units of elements).
-
-### Copy partitioning
-
-The kernel now has tiles of global memory by applying the `CtaTiler` to the full tensors and it also has tiles of shared memory by allocating appropriately. We now want to create an efficient way to copy one tile of global memory to our tile of shared memory. A trivial way to do this would be to use a single thread and copy each element.
-```cpp
-if (thread0()) {
-  Tensor gA0 = gA(_,_,0);  // (BLK_M,BLK_K), the 0th tile
-  for (int i = 0; i < size(sA); ++i) {
-    sA(i) = gA0(i);
-  }
-}
-```
-This would work, but we have lots of threads to use inside this CTA, so let's use them!
-
-If we partition the two tiles of data across the threads in the CTA, then each thread can copy its own subtensor of data. There are lots of ways this partitioning could occur, however.
-
-The `gemm_nt` function defines two layouts of *threads* as
-```c++
-  // Define thread layouts (static)
-  auto tA = make_layout(make_shape(Int<32>{},Int<8>{}));   // (m,k) -> thr_idx
-  auto tB = make_layout(make_shape(Int<32>{},Int<8>{}));   // (n,k) -> thr_idx
-```
-and the `gemm_tn` functions defines two layouts of *threads* as
-```c++
-  // Define thread layouts (static)
-  auto tA = make_layout(make_shape(Int<32>{},Int<8>{}), LayoutRight{});  // (m,k) -> thr_idx; k-major
-  auto tB = make_layout(make_shape(Int<32>{},Int<8>{}), LayoutRight{});  // (n,k) -> thr_idx; k-major
-```
-Both cases happen to use 32x8 threads, which will be used to partition a 128x8 tile of gmem and smem data into a 4x1 subtensor for each thread. The only difference here is that `gemm_nt` uses M-major and N-major threads to match the order of data in global memory and `gemm_tn` uses K-major threads to match the order of data in global memory.
-
-Again, the conditions on the thread layouts are checked inside the kernel.
-```cpp
-  static_assert(is_static<AThreadLayout>::value);
-  static_assert(is_static<BThreadLayout>::value);
-
-  CUTE_STATIC_ASSERT_V(size(tA) == size(tB));                          // NumThreads
-
-  CUTE_STATIC_ASSERT_V(size<0>(cta_tiler) % size<0>(tA) == Int<0>{});  // BLK_M / THR_M
-  CUTE_STATIC_ASSERT_V(size<2>(cta_tiler) % size<1>(tA) == Int<0>{});  // BLK_K / THR_K
-  CUTE_STATIC_ASSERT_V(size<1>(cta_tiler) % size<0>(tB) == Int<0>{});  // BLK_N / THR_N
-  CUTE_STATIC_ASSERT_V(size<2>(cta_tiler) % size<1>(tB) == Int<0>{});  // BLK_K / THR_K
-```
-
-These thread layouts are then used to partition the global memory tensors data and shared memory tensors
-```cpp
-  Tensor tAgA = local_partition(gA, tA, threadIdx.x);    // (THR_M,THR_K,k)
-  Tensor tAsA = local_partition(sA, tA, threadIdx.x);    // (THR_M,THR_K)
-
-  Tensor tBgB = local_partition(gB, tB, threadIdx.x);    // (THR_N,THR_K,k)
-  Tensor tBsB = local_partition(sB, tB, threadIdx.x);    // (THR_N,THR_K)
-
-  CUTE_STATIC_ASSERT_V(size<0>(tAgA) == size<0>(tAsA));  // THR_M
-  CUTE_STATIC_ASSERT_V(size<1>(tAgA) == size<1>(tAsA));  // THR_K
-  CUTE_STATIC_ASSERT_V(size<0>(tBgB) == size<0>(tBsB));  // THR_N
-  CUTE_STATIC_ASSERT_V(size<1>(tBgB) == size<1>(tBsB));  // THR_K
-```
-where `local_partition` is a lot like `local_tile`, except the coordinate slices into the tile-mode (the first mode) of the `zipped_divide` rather than the rest-mode (the second mode). That is, each thread gets one element of data assigned to it per thread tile and that thread tile is repeated to cover the entire data tile.
-
-The naming convention `tAsA` is pretty typical across CuTe and CUTLASS. This is read as "Partitioning pattern `tA` applied to tensor `sA`". In the next section, we'll see a different partitioner applied to `sA` to produce `tCsA`. By applying the same partitioning pattern, `tA`, to tensors `sA` and `gA`, we preserve the *logical consistency* of those tensors (checked by the assertions above) where logical elements between the two tensors correspond despite any differences in their data layouts. When used in `cute::copy`, for example, this naming convention let's us lexically verify that the two tensors are using the same partitioning pattern.
-
-With the data partitioned across the threads, *every thread* can now participate in the copy by writing
-```cpp
-copy(tAgA(_,_,0), tAsA);
-```
-because every thread owns a different subtensor of the tile that will be copied.
-
-### Math partitioning
-
-The kernel now has tiles of shared memory copied in from global memory. We now want to create an efficient way to compute and accumulate the matrix product on that tile of shared memory. A trivial way to do this would be to use a single thread and compute directly.
-```cpp
-if (thread0()) {
-  for (int m = 0; m < size<0>(gC); ++m) {
-    for (int n = 0; n < size<1>(gC); ++n) {
-      for (int k = 0; k < size<1>(sA); ++k) {
-        gC(m,n) += sA(m,k) * sB(n,k);
-      }
-    }
-  }
-}
-```
-This would work, but we have lots of threads to use inside this CTA, so let's use them!
-
-If we partition the output tile `gC` across the threads in the CTA, then each thread can compute its own subtensor. There are lots of ways this partitioning could occur, however.
-
-The `gemm_nt` and `gemm_tn` functions define one more layout of *threads*:
-```cpp
-  // Define thread layouts (static)
-  auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));   // (m,n) -> thr_idx; m-major
-```
-This is a m-major 16x16 layout of threads which will be used to partition a 128x128 tile of `C`-data, resulting in each thread computing its own 8x8 subtensor of `gC`.
-
-Again, the conditions on the thread layouts are checked inside the kernel.
-```cpp
-  static_assert(is_static<CThreadLayout>::value);
-
-  CUTE_STATIC_ASSERT_V(size(tC) == size(tA));                          // NumThreads
-
-  CUTE_STATIC_ASSERT_V(size<0>(cta_tiler) % size<0>(tC) == Int<0>{});  // BLK_M / THR_M
-  CUTE_STATIC_ASSERT_V(size<1>(cta_tiler) % size<1>(tC) == Int<0>{});  // BLK_N / THR_N
-```
-
-These thread layouts are then used to partition the tiles of data in global memory and shared memory
-```cpp
-  // Partition sA (M,K) by the rows of tC
-  Tensor tCsA = local_partition(sA, tC, threadIdx.x, Step<_1, X>{});   // (THR_M,BLK_K)
-  // Partition sB (N,K) by the cols of tC
-  Tensor tCsB = local_partition(sB, tC, threadIdx.x, Step< X,_1>{});   // (THR_N,BLK_K)
-  // Partition gC (M,N) by the tile of tC
-  Tensor tCgC = local_partition(gC, tC, threadIdx.x, Step<_1,_1>{});   // (THR_M,THR_N)
-
-  // Allocate the accumulators -- same shape/layout as the partitioned data
-  Tensor tCrC = make_tensor_like(tCgC);                                // (THR_M,THR_N)
-
-  CUTE_STATIC_ASSERT_V(size<0>(tCrC) == size<0>(tCgC));                // THR_M
-  CUTE_STATIC_ASSERT_V(size<0>(tCrC) == size<0>(tCsA));                // THR_M
-  CUTE_STATIC_ASSERT_V(size<1>(tCrC) == size<1>(tCgC));                // THR_N
-  CUTE_STATIC_ASSERT_V(size<1>(tCrC) == size<0>(tCsB));                // THR_N
-  CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(tCsB));                // BLK_K
-```
-where we've used the same projection-style interface to avoid applying the `N`-mode of `tC` to the `(BLK_M,BLK_K)` shape of `sA` and avoid applying the `M`-mode of `tC` to the `(BLK_N,BLK_K)` shape of `sB`.
-
-<p align="center">
-  <img src="../../images/cute/tC_partitioning.png" alt="tC_partitioning.png" height="300"/>
-</p>
-This diagram shows a `tC` layout, highlights two threads in green and blue, shows the projections of the `tC` layout, and finally highlights the subtensors within `sA`, `sB`, and `gC` that `tCsA`, `tCsB`, and `tCgC` represent.
-
-With the data partitioned across the threads, *every thread* can now participate in the compute step by writing
-```cpp
-gemm(tCsA, tCsB, tCrC);
-```
-because every thread owns different subtensors of the data to be computed.
-
-### Mainloop
-
-The mainloop iterates over tiles of global memory, reads those tiles into shared memory, and then performs the matrix-multiply and accumulates into the accumulators.
-
-```c++
-// TUTORIAL: Example of a very simple compute mainloop
-//   copy(.) operates on the global and shared memory via the tA|tB partitioning
-//   gemm(.) operates on the shared and register memory via the tC partitioning
-
-auto K_TILE_MAX = size<2>(tAgA);
-
-for (int k_tile = 0; k_tile < K_TILE_MAX; ++k_tile)
-{
-  // Copy gmem to smem with tA|tB thread-partitioned tensors
-  copy(tAgA(_,_,k_tile), tAsA);      // A   (THR_M,THR_K) -> (THR_M,THR_K)
-  copy(tBgB(_,_,k_tile), tBsB);      // B   (THR_N,THR_K) -> (THR_N,THR_K)
-
-  cp_async_fence();        // Label the end of (potential) cp.async instructions
-  cp_async_wait<0>();      // Sync on all (potential) cp.async instructions
-  __syncthreads();         // Wait for all threads to write to smem
-
-  // Compute gemm on tC thread-partitioned smem
-  gemm(tCsA, tCsB, tCrC);            // (THR_M,THR_N) += (THR_M,BLK_K) * (THR_N,BLK_K)
-  __syncthreads();         // Wait for all threads to read from smem
-}
-```
-
-We can see that `k_tile` iterates over each tile of data, the `cute::copy` is performed for the current `k_tile` using the `tA` and `tB` thread-partitioned tensors, and the `cute::gemm` is computed for that current `k_tile` using the `tC` thread-partitioned tensors. Synchronization is provided so that this kernel works on any architecture.
-
-## `sgemm_2.cu`
-
-An example that uses more complex `TiledMMA` and `TiledCopy` to perform partitioning in place of the `tA`, `tB`, and `tC` thread layouts. With this example, we try to emphasize that the shared memory layouts, the partitioning patterns, and the PTX instruction to use in each stage can be specified independently.
-
-### TiledCopy
-
-First, we can replace the `tA` partitioning and `tB` partitioning with `TiledCopy` partitioning, which provides for more complex partitioning patterns and checked dispatch to specific copy instructions.
-
-As a first example, lets look at the `TiledCopy` that `gemm_nt` generates.
-```cpp
-  TiledCopy copyA = make_tiled_copy(Copy_Atom<UniversalCopy<uint128_t>, TA>{},  // Atom: Copy TAs as if they were uint128_t
-                                    Layout<Shape<_32,_8>>{},                    // Thr layout 32x8 m-major
-                                    Layout<Shape< _4,_1>>{});                   // Val layout  4x1 m-major
-  print_latex(copyA);
-```
-The easiest way to see what this `TiledCopy` does is to look at the partition pattern in LaTeX.
-<p align="center">
-  <img src="../../images/cute/TiledCopyA.png" alt="TiledCopyA.png" height="300"/>
-</p>
-On the left is the source-tensor partitioning and on the right is the destination-tensor partitioning. The partition patterns are the same for this case, but there exist PTX instructions which require different patterns in the source and destination. The diagram shows that each thread reads 4x1 `TA` elements and there are 32x8 threads. The `UniversalCopy<uint128_t>` forces the instruction to use a 128-bit copy instruction. If the partition (of `sA` or `gA` in this case) does not result in 4 `TA` elements that cannot be vectorized to a 128-bit load/store, then CuTe will statically fail with an error message to that effect.
-
-To use the `TiledCopy`, the kernel writes
-```cpp
-  ThrCopy thr_copy_a = copy_a.get_slice(threadIdx.x);
-  Tensor tAgA = thr_copy_a.partition_S(gA);            // (CPY,CPY_M,CPY_K,k)
-  Tensor tAsA = thr_copy_a.partition_D(sA);            // (CPY,CPY_M,CPY_K)
-  // Allocate registers same shape/layout as partitioned data
-  Tensor tArA = make_fragment_like(tAsA);              // (CPY,CPY_M,CPY_K)
-```
-which applies the source-tensor partitioning to `gA` via `partition_S` and applies the destination-tensor partitioning to `sA` via `partition_D`. The first mode, `CPY`, of the result tensors hold all of the elements that a single instruction will consume. In this case, that mode should have size-4 since there are four `TA=float` elements in a single 128-bit `uint128_t`.
-
-Once the partition has been performed, we can execute the `copy` on the thread-partitioned tensors using the provided instruction in `copy_a`.
-```cpp
-cute::copy(copy_a, tAgA, tArA);
-```
-
-### TiledMMA
-
-Next, we can replace the `tC` partitioning with `TiledMMA` partitioning, which provides for more complex partitioning patterns and checked dispatch to specific MMA instructions.
-
-As a first example, lets look at the `TiledMMA` that `gemm_nt` generates.
-```cpp
-  TiledMMA mmaC = make_tiled_mma(UniversalFMA<TC,TA,TB>{},
-                                 Layout<Shape<_16,_16,_1>>{});  // 16x16x1 UniversalFMA
-  print_latex(mmaC);
-```
-The easiest way to see what this `TiledMMA` does is to look at the partition pattern in LaTeX.
-<p align="center">
-  <img src="../../images/cute/TiledMmaC.png" alt="TiledMmaC.png" height="300"/>
-</p>
-On the left is the A-tensor partitioning, on the top is the B-tensor partitioning, and in the middle is the C-tensor partitioning.Because the `UniversalFMA` is a 1x1x1 MMA instruction, a 16x16x1 tiling of them results in a 16x16x1 `TiledMMA`. Other MMA instructions will have different threads involved and have different instruction sizes. In this case, all threads will read a single element from `A`, `B`, and `C` each.
-
-To use the `TiledMMA`, the kernel writes
-```cpp
-  ThrMMA thr_mma = mma.get_slice(threadIdx.x);
-  Tensor tCsA = thr_mma.partition_A(sA);        // (MMA,MMA_M,MMA_K)
-  Tensor tCsB = thr_mma.partition_B(sB);        // (MMA,MMA_N,MMA_K)
-  Tensor tCgC = thr_mma.partition_C(gC);        // (MMA,MMA_M,MMA_N)
-  // Allocate the accumulators -- same size as the projected data
-  Tensor tCrC = thr_mma.make_fragment_C(tCgC);  // (MMA,MMA_M,MMA_N)
-```
-which applies the A-tensor partitioning to `sA` via `partition_A`, applies the B-tensor partitioning to `sB` via `partition_B`, and applies the C-tensor partitioning to `gC` via `partition_C`. The first mode, `MMA`, of the result tensors hold all of the elements that a single instruction will consume. In this case, that mode should have size-1 since `UniversalFMA` is a 1x1x1 MMA, but in general the size of the first mode can vary and not even be the same across `tCsA`, `tCsB`, and `tCgC` depending on the MMA.
-
-Once the partition has been performed, we can execute the `gemm` on the thread-partitioned tensors using the provided instruction in `mma`.
-```cpp
-cute::gemm(mma, tCsA, tCsB, tCrC);
-```
-
-### Other changes
-
-In this version, we have also updated the shared memory layouts for `gemm_tn` from K-major to
-```cpp
-  // Define the smem layouts (static)
-  auto sA = make_layout(make_shape (      bM,          bK),
-                        make_stride(Int<1>{}, bM+Int<1>{}));  // (m,k) -> smem_idx; padded m-major
-  auto sB = make_layout(make_shape (      bN,          bK),
-                        make_stride(Int<1>{}, bN+Int<1>{}));  // (n,k) -> smem_idx; padded n-major
-```
-which produces M-major and N-major layouts, but they are padded to avoid shared memory bank conflicts. This simply improves the access pattern to and from shared memory and no other changes in the kernel are required.
-
-## `sgemm_sm70.cu`
-
-An example that uses an optimized mainloop for Volta SM70 architectures that pipelines shared memory and register memory.
-
-## `sgemm_sm80.cu`
-
-An example that uses an optimized mainloop for Ampere SM80 architectures that explicitly pipelines shared memory using asynchronous reads from global memory.
-
-## Next steps
-
-All of the above examples assume that the CTA tile size divides the problem size so that global memory loads do no need to be predicated. The
-[predication section of the tutorial](./0y_predication.md)
-explains what to do if a matrix tiling
-doesn't perfectly divide the matrix.
-
-## GETT as GEMM
-
-"GETT" here stands for "general(ized) tensor times tensor," a tensor contraction.
-
-CuTe permits matrices to have nested `Layout`s.
-This means that we can fold a `Tensor` into a "matrix" by grouping modes according to their categories.
-
-As a result, we can implement GETT by using
-our existing GEMM implementation. Included below is a launcher like `gemm_nt` that uses the same device kernel contained in `sgemm_1.cu` to compute a GETT with two m-modes.
-```cpp
-// Setup params for a GETT with two m-modes.
-// The A and C tensors are assumed to be m0-major.
-//   Calls sgemm_1.cu's gemm_device<<<>>> without modification.
-template <class TA, class TB, class TC,
-          class Alpha, class Beta>
-void
-gett(int m0, int m1, int n, int k,
-     Alpha alpha,
-     TA const* A, int ldAm1, int ldAk,  // m0-major
-     TB const* B, int ldBk,
-     Beta beta,
-     TC      * C, int ldCm1, int ldCn,  // m0-major
-     cudaStream_t stream = 0)
-{
-  using namespace cute;
-
-  // Define shapes (dynamic)
-  auto M = make_shape(m0, m1);                               // (m0,m1)-multimode M
-  auto N = int(n);
-  auto K = int(k);
-  auto prob_shape = make_shape(M, N, K);                     // (M, N, K)
-
-  // Define NT strides (mixed)
-  auto dA = make_stride(make_stride(Int<1>{}, ldAm1), ldAk); // (dM, dK)
-  auto dB = make_stride(Int<1>{}, ldB);                      // (dN, dK)
-  auto dC = make_stride(make_stride(Int<1>{}, ldCm1), ldCn); // (dM, dN)
-
-  // Define CTA tile sizes (static)
-  auto bM = Shape<_64, _2>{};    // Take _64 elements from m0 and _2 elements from m1
-  auto bN = Int<128>{};
-  auto bK = Int<  8>{};
-  auto cta_tiler = make_shape(bM, bN, bK);                   // (BLK_M, BLK_N, BLK_K)
-
-  // Define the smem layouts (static)
-  auto sA = make_layout(make_shape(bM, bK));                 // (m,k) -> smem_idx; m-major
-  auto sB = make_layout(make_shape(bN, bK));                 // (n,k) -> smem_idx; n-major
-  auto sC = make_layout(make_shape(bM, bN));                 // (m,n) -> smem_idx; m-major
-
-  // Define the thread layouts (static)
-  auto tA = make_layout(make_shape(Int<32>{}, Int< 8>{}));   // (m,k) -> thr_idx
-  auto tB = make_layout(make_shape(Int<32>{}, Int< 8>{}));   // (n,k) -> thr_idx
-  auto tC = make_layout(make_shape(Int<16>{}, Int<16>{}));   // (m,n) -> thr_idx
-
-  dim3 dimBlock(size(tC));
-  dim3 dimGrid(size(ceil_div(M, bM)),
-               size(ceil_div(N, bN)));
-  gemm_device<<<dimGrid, dimBlock, 0, stream>>>
-      (prob_shape, cta_tiler,
-       A, dA, sA, tA,
-       B, dB, sB, tB,
-       C, dC, sC, tC,
-       alpha, beta);
-}
-```
-Note that the only changes are the definition of shape `M`, the definition of strides `dA` and `dC`, and the definition of the CTA Tiler `bM`. The above uses a multimodel problem shape `M = (m0,m1)` and a multimodal CTA Tiler `bM = <_64,_2>` to change which portion of the global memory tensors `A` and `C` each CTA will be responsible for computing.
-
-Similar examples can be found for CUTLASS 3.x kernels that are based on CuTe, such as [this Hopper GETT example](../../../examples/51_hopper_gett).
diff --git a/executor/op-mem-cuda/docs/0y_predication.md b/executor/op-mem-cuda/docs/0y_predication.md
deleted file mode 100644
index f764508b..00000000
--- a/executor/op-mem-cuda/docs/0y_predication.md
+++ /dev/null
@@ -1,217 +0,0 @@
-# Predication: What to do when tiling isn't perfect
-
-The [GEMM tutorial](./0x_gemm_tutorial.md) shows how
-we compute a matrix-matrix multiply
-by iterating over tiles of the input matrices and output matrix.
-The examples all assume that the tiles fit evenly into the matrices,
-with no remainder.
-What do we do if this is not the case?
-For example, we might want to tile a 41 x 55 matrix into 4 x 8 tiles,
-but 41 / 4 is 10 remainder 1, and 55 / 8 is 6 remainder 7.
-What do we do with those "leftover" parts of the matrix?
-
-Another way to say this, is that `logical_divide`
-(CuTe's way of tiling layouts) "rounds up."
-For example, if `N` is the layout (1000, 1) and `B` is the layout (128, 1),
-then `logical_divide(N, B)` is the layout ((128, 8), (1, 128)).
-This effectively rounds up the original shape N = 1000
-into an 128 x 8 matrix (as if N = 1024).
-What about those last 24 elements,
-that aren't part of the original data?
-
-The idiomatic CuTe way to solve this problem is through "predication."
-Rather than trying to reason about the "remainder tiles,"
-CuTe instead rounds up, but only tries to access data in each tile
-that are part of the matrix.
-This corresponds well with how our GPUs optimize:
-branches without warp divergence are relatively fast.
-It also matches the usual CUDA idiom
-when dividing N work items in 1-D fashion over B thread blocks:
-first test if "my thread" is out of bounds before doing work.
-
-There are a few ways to figure out
-which elements need to be predicated.
-In-kernel GEMMs like to do this in the following way.
-
-```c++
-// Create the predicate tensor
-Layout idA  = make_layout(shape(A));   // e.g. 1000:1
-Layout idAB = logical_divide(idA, B);  // e.g. (128,8):(1,128)
-
-Tensor pred = make_tensor<bool>(shape(idAB));
-for (int i = 0; i < size(pred); ++i) {
-  pred(i) = idAB(i) < size(A);
-}
-
-// ... intervening code ...
-
-// Use the predicate tensor.  c is some coordinate.
-// This code would likely live inside some algorithm.
-if (pred(c)) { copy(idAB(c), smem(c)); }
-```
-
-The general procedure is that we
-
-1. create an "identity" layout (`Layout idA  = make_layout(shape(A))`,
-   in the above example) with the same shape as our original data;
-
-2. repeat the same tiling/partitioning/slicing (possibly rounding up)
-   on that identity layout (`Layout idAB = logical_divide(idA, B)`);
-
-3. create a "predicate tensor" by comparing the coordinates
-   of that reference layout with the bounds of the original layout;
-   and then
-
-4. use the predicate tensor to mask off accesses to out-of-bounds elements.
-
-For example, suppose that we've partitioned A and B tiles
-across threads as follows.
-
-```c++
-Tensor tAgA = local_partition(gA, tA, thread_idx);                  // (THR_M,THR_K,k)
-Tensor tAsA = local_partition(sA, tA, thread_idx);                  // (THR_M,THR_K,PIPE)
-
-Tensor tBgB = local_partition(gB, tB, thread_idx);                  // (THR_N,THR_K,k)
-Tensor tBsB = local_partition(sB, tB, thread_idx);                  // (THR_N,THR_K,PIPE)
-```
-
-`tAgA` and `tBgB` partition the global A resp. B matrices over threads,
-and `tAsA` and `tBsB` partition the shared memory tiles of A resp. B over threads.
-
-The following code creates predicate tensors
-corresponding to `tAgA` and `tBgB`.
-They will be computed once in the prologue.
-and will be used to mask off instructions in the inner loop.
-
-```c++
-Tensor tApA = make_tensor<bool>(make_shape (size<0>(tAgA), size<1>(tAgA)),
-                                make_stride(     Int<1>{},      Int<0>{}));
-Tensor tBpB = make_tensor<bool>(make_shape (size<0>(tBgB), size<1>(tBgB)),
-                                make_stride(     Int<1>{},      Int<0>{}));
-```
-
-We're only thread-parallelizing over the leftmost (row) dimension,
-so we only need to predicate over the leftmost dimension.
-Thus, we can make the rightmost (column) stride zero,
-since we will never actually address the rightmost dimension.
-
-The following code creates "two-dimensional identity tensors"
-that map coordinates (m,k) -> (m,k)
-for the tile of data within the thread block.
-
-```c++
-Tensor cA = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));   // (BLK_M,BLK_K) -> (blk_m,blk_k)
-Tensor cB = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));   // (BLK_N,BLK_K) -> (blk_n,blk_k)
-```
-
-The following lines then tile and partition
-the two reference tensors
-in exactly the same way the data were tiled and partitioned
-into `tAsA` and `tBsB`.
-
-```c++
-Tensor tAcA = local_partition(cA, tA, thread_idx);
-Tensor tBcB = local_partition(cB, tB, thread_idx);
-```
-
-Tiling and partitioning affect the offset and domain,
-but not the codomain of the tensors,
-so we're left with tensors that map `(thr_m,thr_k) -> (m,k)`
-where `(thr_m,thr_k)` is this particular thread's subtensor of the tile
-and `(m,k)` is the original codomain: a coordinate into the original tile.
-
-The unrolled loops in the code below then compare
-the m- and n-coordinates of those tensors with our known maximums
-to mask off elements we are not allowed to access.
-
-```c++
-Tensor cA   = make_identity_tensor(make_shape(size<0>(sA), size<1>(sA)));  // (BLK_M,BLK_K) -> (blk_m,blk_k)
-Tensor tAcA = local_partition(cA, tA, thread_idx);
-
-Tensor cB   = make_identity_tensor(make_shape(size<0>(sB), size<1>(sB)));  // (BLK_N,BLK_K) -> (blk_n,blk_k)
-Tensor tBcB = local_partition(cB, tB, thread_idx);
-
-// Populate
-CUTE_UNROLL
-for (int m = 0; m < size<0>(tApA); ++m) {
-  tApA(m,0) = get<0>(tAcA(m,0)) < m_max_coord;
-}
-CUTE_UNROLL
-for (int n = 0; n < size<0>(tBpB); ++n) {
-  tBpB(n,0) = get<0>(tBcB(n,0)) < n_max_coord;
-}
-```
-
-Those last `for` loops fill in the two predicate tensors.
-In this case, we only need to predicate over the leftmost dimension,
-so we only address `(m,0)` resp. `(n,0)`.
-
-We can then use the predicate tensors in `copy_if`
-to copy only the elements for which the corresponding
-predicate tensor elements are nonzero.
-
-```c++
-// Prefetch k_tile=0, gate these on k_residue as well
-CUTE_UNROLL
-for (int k = 0; k < size<1>(tAsA); ++k) {
-  if (get<1>(tAcA(0,k)) >= -k_residue) { // some other condition on the column index
-    copy_if(tApA, tAgA(_,k,0), tAsA(_,k,0));
-  }
-}
-
-CUTE_UNROLL
-for (int k = 0; k < size<1>(tBsB); ++k) {
-  if (get<1>(tBcB(0,k)) >= -k_residue) { // some other condition on the column index
-    copy_if(tBpB, tBgB(_,k,0), tBsB(_,k,0));
-  }
-}
-```
-
-Here are some advantages of this "reference tensor" approach.
-
-1. It doesn't depend on the layout/strides of the tensor
-   being predicated, just the logical bounds being imposed.
-
-2. The partitioning stage can be anything.
-
-3. It naturally extends to any-dimensional predication.
-
-4. It's a natural generalization of a typical CUDA 1-D
-   parallel vector access pattern,
-   which computes an access index `k`
-   (e.g., as `blockDim.x * blockIdx.x + threadIdx.x`)
-   and then predicates access to the vector's `k`-th element
-   on whether `k` is in bounds.
-
-As an example of (3), the epilogue predication does exactly the same thing,
-
-```c++
-// Repeat with a tensor of coordinates for predication
-Tensor cC   = make_identity_tensor(make_shape(size<0>(gC), size<1>(gC)));
-Tensor tCcC = thr_mma.partition_C(cC);
-
-const bool isBetaZero = (beta == 0);
-
-CUTE_UNROLL
-for (int i = 0; i < size(tCrC); ++i) {
-  if (elem_less(tCcC(i), make_coord(m_max_coord,n_max_coord))) {
-    tCgC(i) = isBetaZero ? alpha * tCrC(i) : alpha * tCrC(i) + beta * tCgC(i);
-  }
-}
-```
-
-but with the mma responsible for the tiling/partitioning `tCcC`
-so that the reference subtensor matches the accumulator's subtensor.
-Then, the reference subtensor is predicated against the `if` bounds
-(in both m- and n-coordinates) inside the `for` loop.
-
-Another way to explain this is that we don't modify the tiles
-to give you the "right" extents so that you never overrun.
-Instead, we let you query the original coordinate
-to see if that coordinate overruns.
-This avoids all branching and variable/dynamic loop bounds
-(thus maintaining load balance and synchronicity,
-both very important in-kernel) in favor of predication.
-It's also general enough to extend to all ranks,
-all layouts of threads and data,
-and all tiling/partitioning patterns.
diff --git a/executor/op-mem-cuda/docs/0z_tma_tensors.md b/executor/op-mem-cuda/docs/0z_tma_tensors.md
deleted file mode 100644
index 6afd895d..00000000
--- a/executor/op-mem-cuda/docs/0z_tma_tensors.md
+++ /dev/null
@@ -1,233 +0,0 @@
-# CuTe TMA Tensors
-
-Along your travels, you may find strange looking CuTe Tensors that are printed as something like
-```
-ArithTuple(0,_0,_0,_0) o ((_128,_64),2,3,1):((_1@0,_1@1),_64@1,_1@2,_1@3)
-```
-What is an `ArithTuple`? Are those tensor strides? What do those mean? What is this for?
-
-This documentation intends to answer those questions and introduce some of the more advanced features of CuTe.
-
-# Introduction to TMA instructions
-
-The Tensor Memory Accelerator (TMA) is a set of instructions for copying possibly multidimensional arrays between global and shared memory.  TMA was introduced in the Hopper architecture.  A single TMA instruction can copy an entire tile of data all at once.  As a result, the hardware no longer needs to compute individual memory addresses and issue a separate copy instruction for each element of the tile.
-
-To accomplish this, the TMA instruction is given a *TMA descriptor*, which is a packed representation of a multidimensional tensor in global memory with 1, 2, 3, 4, or 5 dimensions. The TMA descriptor holds
-
-* the base pointer of the tensor;
-
-* the data type of the tensor's elements (e.g., `int`, `float`, `double`, or `half`);
-
-* the size of each dimension;
-
-* the stride within each dimension; and
-
-* other flags representing the smem box size, smem swizzling patterns, and out-of-bounds access behavior.
-
-This descriptor must be created on the host before kernel execution.
-It is shared between all thread blocks that will be issuing TMA instructions.
-Once inside the kernel, the TMA is executed with the following parameters:
-
-* pointer to the TMA descriptor;
-
-* pointer to the SMEM; and
-
-* coordinates into the GMEM tensor represented within the TMA descriptor.
-
-For example, the interface for TMA-store with 3-D coordinates looks like this.
-
-```cpp
-struct SM90_TMA_STORE_3D {
-  CUTE_DEVICE static void
-  copy(void const* const desc_ptr,
-       void const* const smem_ptr,
-       int32_t const& crd0, int32_t const& crd1, int32_t const& crd2) {
-    // ... invoke CUDA PTX instruction ...
-  }
-};
-```
-
-We observe that the TMA instruction does not directly consume pointers to global memory. Indeed, the global memory pointer is contained in the descriptor, is considered constant, and is NOT a separate parameter to the TMA instruction. Instead, the TMA consumes TMA coordinates into the TMA's view of global memory that is defined in the TMA descriptor.
-
-That means that an ordinary CuTe Tensor that stores a GMEM pointer and computes offsets and new GMEM pointers is useless to the TMA.
-
-What do we do?
-
-# Building a TMA Tensor
-
-## Implicit CuTe Tensors
-
-All CuTe Tensors are compositions of Layouts and Iterators. An ordinary global memory tensor's iterator is its global memory pointer. However, a CuTe Tensor's iterator doesn't have to be a pointer; it can be any random-access iterator.
-
-One example of such an iterator is a *counting iterator*.
-This represents a possibly infinite sequence of integers that starts at some value.
-We call the members of this sequence *implicit integers*,
-because the sequence is not explicitly stored in memory.
-The iterator just stores its current value.
-
-We can use a counting iterator to create a tensor of implicit integers,
-```cpp
-Tensor A = make_tensor(counting_iterator<int>(42), make_shape(4,5));
-print_tensor(A);
-```
-which outputs
-```
-counting_iter(42) o (4,5):(_1,4):
-   42   46   50   54   58
-   43   47   51   55   59
-   44   48   52   56   60
-   45   49   53   57   61
-```
-This tensor maps logical coordinates to on-the-fly computed integers. Because it's still a CuTe Tensor, it can still be tiled and partitioned and sliced just like a normal tensor by accumulating integer offsets into the iterator.
-
-But the TMA doesn't consume pointers or integers, it consumes coordinates. Can we make a tensor of implicit TMA
-coordinates for the TMA instruction to consume? If so, then we could presumably also tile and partition and slice that tensor of coordinates so that we would always have the right TMA coordinate to give to the instruction.
-
-## ArithTupleIterators and ArithTuples
-
-First, we build a `counting_iterator` equivalent for TMA coordinates. It should support
-
-* dereference to a TMA coordinate, and
-
-* offset by another TMA coordinate.
-
-We'll call this an `ArithmeticTupleIterator`. It stores a coordinate (a tuple of integers) that is represented as an `ArithmeticTuple`. The `ArithmeticTuple` is simply a (public subclass of) `cute::tuple` that has an overloaded `operator+` so that it can be offset by another tuple. The sum of two tuples is the tuple of the sum of the elements.
-
-Now similar to `counting_iterator<int>(42)` we can create an implicit "iterator" (but without increment or other common iterator operations) over tuples that can be dereferenced and offset by other tuples
-```cpp
-ArithmeticTupleIterator citer_1 = make_inttuple_iter(42, Int<2>{}, Int<7>{});
-ArithmeticTupleIterator citer_2 = citer_1 + make_tuple(Int<0>{}, 5, Int<2>{});
-print(*citer_2);
-```
-which outputs
-```
-(42,7,_9)
-```
-
-A TMA Tensor can use an iterator like this to store the current TMA coordinate "offset". The "offset" here is in quotes because it's clearly not a normal 1-D array offset or pointer.
-
-In summary, one creates a TMA descriptor for the *whole global memory tensor*. The TMA descriptor defines a view into that tensor and the instruction takes TMA coordinates into that view. In order to generate and track those TMA coordinates, we define an implicit CuTe Tensor of TMA coordinates that can be tiled, sliced, and partitioned the exact same way as an ordinary CuTe Tensor.
-
-We can now track and offset TMA coordinates with this iterator, but how do we get CuTe Layouts to generate non-integer offsets?
-
-## Strides aren't just integers
-
-Ordinary tensors have a layout that maps
-a logical coordinate `(i,j)` into a 1-D linear index `k`.
-This mapping is the inner-product of the coordinate with the strides.
-
-TMA Tensors hold iterators of TMA coordinates.
-Thus, a TMA Tensor's Layout must map a logical coordinate
-to a TMA coordinate, rather than to a 1-D linear index.
-
-To do this, we can abstract what a stride is. Strides need not be integers, but rather any algebraic object that supports inner-product with the integers (the logical coordinate). The obvious choice is the `ArithmeticTuple` we used earlier since they can be added to each other, but this time additionally equipped with an `operator*` so it can also be scaled by an integer.
-
-### Aside: Integer-module strides
-
-A group of objects that support addition between elements and product between elements and integers is called an integer-module.
-
-Formally, an integer-module is an abelian group `(M,+)` equipped with `Z*M -> M`, where `Z` are the integers. That is, an integer-module `M` is
-a group that supports inner products with the integers.
-The integers are an integer-module.
-Rank-R tuples of integers are an integer-module.
-
-In principle, layout strides may be any integer-module.
-
-### Basis elements
-
-CuTe's basis elements live in the header file `cute/numeric/arithmetic_tuple.hpp`.
-To make it easy to create `ArithmeticTuple`s that can be used as strides, CuTe defines normalized basis elements using the `E` type alias. "Normalized" means that the scaling factor of the basis element is the compile-time integer 1.
-
-| C++ object | Description           | String representation |
-| ---        | ---                   | ---                   |
-| `E<>{}`    | `1`                   | `1`                   |
-| `E<0>{}`   | `(1,0,...)`           | `1@0`                 |
-| `E<1>{}`   | `(0,1,0,...)`         | `1@1`                 |
-| `E<0,1>{}` | `((0,1,0,...),0,...)` | `1@1@0`               |
-| `E<1,0>{}` | `(0,(1,0,...),0,...)` | `1@0@1`               |
-
-The "description" column in the above table
-interprets each basis element as an infinite tuple of integers,
-where all the tuple's entries not specified by the element's type are zero.
-We count tuple entries from left to right, starting with zero.
-For example, `E<1>{}` has a 1 in position 1: `(0,1,0,...)`.
-`E<3>{}` has a 1 in position 3: `(0,0,0,1,0,...)`.
-
-Basis elements can be *nested*.
-For instance, in the above table, `E<0,1>{}` means that
-in position 0 there is a `E<1>{}`: `((0,1,0,...),0,...)`.
-
-Basis elements can be *scaled*.
-That is, they can be multiplied by an integer *scaling factor*.
-For example, in `5*E<1>{}`, the scaling factor is `5`.
-`5*E<1>{}` prints as `5@1` and means `(0,5,0,...)`.
-The scaling factor commutes through any nesting.
-For instance, `5*E<0,1>{}` prints as `5@1@0`
-and means `((0,5,0,...),0,...)`.
-
-Basis elements can also be added together,
-as long as their hierarchical structures are compatible.
-For example, `3*E<0>{} + 4*E<1>{}` results in `(3,4,0,...)`.
-Intuitively, "compatible" means that
-the nested structure of the two basis elements
-matches well enough to add the two elements together.
-
-### Linear combinations of strides
-
-Layouts work by taking the inner product
-of the natural coordinate with their strides.
-For strides made of integer elements, e.g., `(1,100)`,
-the inner product of the input coordinate `(i,j)`
-and the stride is `i + 100j`.
-Offsetting an "ordinary" tensor's pointer and this index
-gives the pointer to the tensor element at `(i,j)`.
-
-For strides of basis elements, we still compute the inner product of the natural coordinate with the strides.
-For example, if the stride is `(1@0,1@1)`,
-then the inner product of the input coordinate `(i,j)`
-with the strides is `i@0 + j@1 = (i,j)`.
-That translates into the (TMA) coordinate `(i,j)`.
-If we wanted to reverse the coordinates,
-then we could use `(1@1,1@0)` as the stride.
-Evaluating the layout would give `i@1 + j@0 = (j,i)`.
-
-A linear combination of basis elements
-can be interpreted as a possibly multidimensional and hierarchical coordinate.
-For instance, `2*2@1@0 + 3*1@1 + 4*5@1 + 7*1@0@0`
-means `((0,4,...),0,...) + (0,3,0,...) + (0,20,0,...) + ((7,...),...) = ((7,4,...),23,...)`
-and can be interpreted as the coordinate `((7,4),23)`.
-
-Thus, linear combinations of these strides can be used to generate TMA coordinates.
-These coordinates, in turn, can be used to offset TMA coordinate iterators.
-
-## Application to TMA Tensors
-
-Now we can build CuTe Tensors like the one seen in the introduction.
-
-```cpp
-Tensor a = make_tensor(make_inttuple_iter(0,0),
-                       make_shape (     4,      5),
-                       make_stride(E<0>{}, E<1>{}));
-print_tensor(a);
-
-Tensor b = make_tensor(make_inttuple_iter(0,0),
-                       make_shape (     4,      5),
-                       make_stride(E<1>{}, E<0>{}));
-print_tensor(b);
-```
-prints
-```
-ArithTuple(0,0) o (4,5):(_1@0,_1@1):
-  (0,0)  (0,1)  (0,2)  (0,3)  (0,4)
-  (1,0)  (1,1)  (1,2)  (1,3)  (1,4)
-  (2,0)  (2,1)  (2,2)  (2,3)  (2,4)
-  (3,0)  (3,1)  (3,2)  (3,3)  (3,4)
-
-ArithTuple(0,0) o (4,5):(_1@1,_1@0):
-  (0,0)  (1,0)  (2,0)  (3,0)  (4,0)
-  (0,1)  (1,1)  (2,1)  (3,1)  (4,1)
-  (0,2)  (1,2)  (2,2)  (3,2)  (4,2)
-  (0,3)  (1,3)  (2,3)  (3,3)  (4,3)
-```
-
-
diff --git a/executor/op-mem-mps/.gitignore b/executor/op-mem-mps/.gitignore
new file mode 100644
index 00000000..93ad0a0e
--- /dev/null
+++ b/executor/op-mem-mps/.gitignore
@@ -0,0 +1,5 @@
+build/
+.DS_Store
+*.xcuserdata/
+*.xcworkspace/
+*.xcodeproj/
diff --git a/executor/op-mem-mps/CMakeLists.txt b/executor/op-mem-mps/CMakeLists.txt
new file mode 100644
index 00000000..9b361f5a
--- /dev/null
+++ b/executor/op-mem-mps/CMakeLists.txt
@@ -0,0 +1,102 @@
+cmake_minimum_required(VERSION 3.15...3.29)
+project(deepx-executor-mps LANGUAGES CXX OBJCXX)
+
+# 设置 C++ 标准
+set(CMAKE_CXX_STANDARD 17)
+set(CMAKE_CXX_STANDARD_REQUIRED True)
+set(CMAKE_BUILD_TYPE Debug)
+
+# 仅支持 macOS
+if(NOT APPLE)
+    message(FATAL_ERROR "op-mem-mps 仅支持 macOS")
+endif()
+
+# 包含头文件目录
+include_directories(src)
+
+add_subdirectory(../cpp-common common)
+
+# 源文件
+file(GLOB_RECURSE DEEPX_SOURCES "src/deepx/*.cpp" "src/deepx/*.mm" "src/deepx/*.hpp" "src/deepx/*.h")
+file(GLOB_RECURSE CLIENT_SOURCES "src/client/*.cpp" "src/client/*.mm")
+
+find_library(METAL_FRAMEWORK Metal REQUIRED)
+find_library(MPS_FRAMEWORK MetalPerformanceShaders REQUIRED)
+find_library(FOUNDATION_FRAMEWORK Foundation REQUIRED)
+
+find_package(yaml-cpp REQUIRED)
+set(YAMLCPP_LIB "")
+if (TARGET yaml-cpp::yaml-cpp)
+  set(YAMLCPP_LIB yaml-cpp::yaml-cpp)
+else()
+  set(YAMLCPP_LIB yaml-cpp)
+endif()
+
+add_library(deepx_mps SHARED
+    ${DEEPX_SOURCES}
+)
+
+# --- Offline compile Metal shaders into default.metallib ---
+set(METAL_SRC_DIR ${CMAKE_CURRENT_SOURCE_DIR}/src/deepx/tensorfunc)
+set(METAL_SOURCES
+  ${METAL_SRC_DIR}/elementwise_miaobyte.metal
+)
+
+set(METAL_AIR_DIR ${CMAKE_CURRENT_BINARY_DIR}/metal_air)
+set(METAL_LIB ${CMAKE_CURRENT_BINARY_DIR}/default.metallib)
+
+option(DEEPX_MPS_OFFLINE_METAL "Build .metallib at build time (requires Xcode/metal tools)" ON)
+
+if (DEEPX_MPS_OFFLINE_METAL)
+  execute_process(
+    COMMAND xcrun -sdk macosx -find metal
+    OUTPUT_VARIABLE METALC
+    OUTPUT_STRIP_TRAILING_WHITESPACE
+    RESULT_VARIABLE METALC_RV
+    ERROR_QUIET
+  )
+  execute_process(
+    COMMAND xcrun -sdk macosx -find metallib
+    OUTPUT_VARIABLE METALLIB
+    OUTPUT_STRIP_TRAILING_WHITESPACE
+    RESULT_VARIABLE METALLIB_RV
+    ERROR_QUIET
+  )
+
+  if (METALC_RV EQUAL 0 AND METALLIB_RV EQUAL 0)
+    add_custom_command(
+      OUTPUT ${METAL_LIB}
+      COMMAND ${CMAKE_COMMAND} -E make_directory ${METAL_AIR_DIR}
+      COMMAND ${METALC} -c ${METAL_SRC_DIR}/elementwise_miaobyte.metal -o ${METAL_AIR_DIR}/elementwise_miaobyte.air
+      COMMAND ${METALLIB} ${METAL_AIR_DIR}/elementwise_miaobyte.air -o ${METAL_LIB}
+      DEPENDS ${METAL_SOURCES}
+      VERBATIM
+    )
+
+    add_custom_target(deepx_metal_kernels ALL DEPENDS ${METAL_LIB})
+    add_dependencies(deepx_mps deepx_metal_kernels)
+  else()
+    message(WARNING "Metal offline tools not found (xcrun metal/metallib). Install Xcode or CLT, or set -DDEEPX_MPS_OFFLINE_METAL=OFF.")
+  endif()
+endif()
+# ----------------------------------------------------------
+
+target_link_libraries(deepx_mps
+    PUBLIC
+    deepx_common
+    ${YAMLCPP_LIB}
+    ${METAL_FRAMEWORK}
+    ${MPS_FRAMEWORK}
+    ${FOUNDATION_FRAMEWORK}
+)
+
+add_executable(${PROJECT_NAME} ${CLIENT_SOURCES})
+
+target_link_libraries(${PROJECT_NAME}
+    PRIVATE
+    deepx_mps
+)
+
+# 测试
+add_subdirectory(test/tensorfunc)
+
diff --git a/executor/op-mem-mps/README.md b/executor/op-mem-mps/README.md
new file mode 100644
index 00000000..39df974a
--- /dev/null
+++ b/executor/op-mem-mps/README.md
@@ -0,0 +1,21 @@
+# op-mem-mps
+
+基于 Apple Metal / MPS 的 DeepX 执行器原型（macOS）。
+
+## 依赖
+
+- macOS
+- Xcode Command Line Tools
+- CMake 3.15+
+
+## 构建
+
+```bash
+./build.sh
+```
+
+## 运行
+
+```bash
+./build/deepx-executor-mps
+```
diff --git a/executor/op-mem-mps/build.sh b/executor/op-mem-mps/build.sh
new file mode 100755
index 00000000..415fdc3a
--- /dev/null
+++ b/executor/op-mem-mps/build.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+mkdir -p build
+cd build
+rm -rf ./*
+cmake ..
+cmake --build . -j$(sysctl -n hw.ncpu)
diff --git a/executor/op-mem-mps/src/client/main.mm b/executor/op-mem-mps/src/client/main.mm
new file mode 100644
index 00000000..2a16c768
--- /dev/null
+++ b/executor/op-mem-mps/src/client/main.mm
@@ -0,0 +1,15 @@
+#include <iostream>
+#include "deepx/mps_context.hpp"
+#include "deepx/mps_device.hpp"
+
+int main()
+{
+    deepx::mps::MpsContext ctx;
+    auto info = deepx::mps::get_default_device_info();
+
+    std::cout << "MPS device: " << info.name
+              << " (available=" << (info.supports_mps ? "true" : "false") << ")\n";
+    std::cout << "Context valid: " << (ctx.is_valid() ? "true" : "false") << "\n";
+
+    return ctx.is_valid() ? 0 : 1;
+}
diff --git a/executor/op-mem-mps/src/deepx/dtype_mps.hpp b/executor/op-mem-mps/src/deepx/dtype_mps.hpp
new file mode 100644
index 00000000..a10858a0
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/dtype_mps.hpp
@@ -0,0 +1,32 @@
+
+#include <type_traits>
+
+#include "deepx/dtype.hpp"
+
+
+namespace deepx
+{
+    // Map C++ scalar type -> Precision (used by tensor constructors and tensorlife helpers)
+    template <typename T>
+    inline constexpr Precision precision()
+    {
+        if constexpr (std::is_same_v<T, double>)
+            return Precision::Float64;
+        else if constexpr (std::is_same_v<T, float>)
+            return Precision::Float32;
+        else if constexpr (std::is_same_v<T, int64_t>)
+            return Precision::Int64;
+        else if constexpr (std::is_same_v<T, int32_t>)
+            return Precision::Int32;
+        else if constexpr (std::is_same_v<T, int16_t>)
+            return Precision::Int16;
+        else if constexpr (std::is_same_v<T, int8_t>)
+            return Precision::Int8;
+        else if constexpr (std::is_same_v<T, bool>)
+            return Precision::Bool;
+        else if constexpr (std::is_same_v<T, std::string>)
+            return Precision::String;
+        else
+            return Precision::Any;
+    }
+}
\ No newline at end of file
diff --git a/executor/op-mem-mps/src/deepx/mps_context.hpp b/executor/op-mem-mps/src/deepx/mps_context.hpp
new file mode 100644
index 00000000..8f5d6e70
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/mps_context.hpp
@@ -0,0 +1,34 @@
+#pragma once
+
+#include <string>
+
+#ifdef __OBJC__
+#import <Foundation/Foundation.h>
+@protocol MTLDevice;
+@protocol MTLCommandQueue;
+#endif
+
+namespace deepx::mps
+{
+class MpsContext
+{
+public:
+    MpsContext();
+    bool is_valid() const;
+    std::string device_name() const;
+
+#ifdef __OBJC__
+    id<MTLDevice> device() const;
+    id<MTLCommandQueue> command_queue() const;
+#endif
+
+private:
+#ifdef __OBJC__
+    id<MTLDevice> device_{nil};
+    id<MTLCommandQueue> command_queue_{nil};
+#else
+    void *device_{nullptr};
+    void *command_queue_{nullptr};
+#endif
+};
+}
diff --git a/executor/op-mem-mps/src/deepx/mps_context.mm b/executor/op-mem-mps/src/deepx/mps_context.mm
new file mode 100644
index 00000000..be4a1484
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/mps_context.mm
@@ -0,0 +1,40 @@
+#import <Foundation/Foundation.h>
+#import <Metal/Metal.h>
+
+#include "deepx/mps_context.hpp"
+
+namespace deepx::mps
+{
+MpsContext::MpsContext()
+{
+    device_ = MTLCreateSystemDefaultDevice();
+    if (device_)
+    {
+        command_queue_ = [device_ newCommandQueue];
+    }
+}
+
+bool MpsContext::is_valid() const
+{
+    return device_ != nil;
+}
+
+std::string MpsContext::device_name() const
+{
+    if (!device_)
+    {
+        return "none";
+    }
+    return std::string([[device_ name] UTF8String]);
+}
+
+id<MTLDevice> MpsContext::device() const
+{
+    return device_;
+}
+
+id<MTLCommandQueue> MpsContext::command_queue() const
+{
+    return command_queue_;
+}
+}
diff --git a/executor/op-mem-mps/src/deepx/mps_device.hpp b/executor/op-mem-mps/src/deepx/mps_device.hpp
new file mode 100644
index 00000000..1235fbda
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/mps_device.hpp
@@ -0,0 +1,14 @@
+#pragma once
+
+#include <string>
+
+namespace deepx::mps
+{
+struct DeviceInfo
+{
+    std::string name;
+    bool supports_mps{false};
+};
+
+DeviceInfo get_default_device_info();
+}
diff --git a/executor/op-mem-mps/src/deepx/mps_device.mm b/executor/op-mem-mps/src/deepx/mps_device.mm
new file mode 100644
index 00000000..d67972d1
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/mps_device.mm
@@ -0,0 +1,25 @@
+#import <Foundation/Foundation.h>
+#import <Metal/Metal.h>
+#import <MetalPerformanceShaders/MetalPerformanceShaders.h>
+
+#include "deepx/mps_device.hpp"
+
+namespace deepx::mps
+{
+DeviceInfo get_default_device_info()
+{
+    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
+    DeviceInfo info;
+
+    if (!device)
+    {
+        info.name = "none";
+        info.supports_mps = false;
+        return info;
+    }
+
+    info.name = std::string([[device name] UTF8String]);
+    info.supports_mps = true;
+    return info;
+}
+}
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_common.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_common.hpp
new file mode 100644
index 00000000..ce03f0b8
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_common.hpp
@@ -0,0 +1,94 @@
+#ifndef DEEPX_TENSORFUNC_ELEMENTWISE_COMMON_HPP
+#define DEEPX_TENSORFUNC_ELEMENTWISE_COMMON_HPP
+
+#if defined(__APPLE__)
+  #include <TargetConditionals.h>
+#endif
+
+#include <cstdint>
+#include <stdexcept>
+#include <type_traits>
+
+#include "deepx/tensor.hpp"
+#include "deepx/tensorfunc/mps_common.hpp"
+
+namespace deepx::tensorfunc::detail
+{
+    template <typename T>
+    inline void assert_same_shape(const Tensor<T> &A, const Tensor<T> &B, const Tensor<T> &C)
+    {
+        if (A.shape.size != B.shape.size || A.shape.size != C.shape.size ||
+            A.shape.shape != B.shape.shape || A.shape.shape != C.shape.shape)
+        {
+            throw std::invalid_argument("shape mismatch");
+        }
+    }
+
+    template <typename T>
+    inline void add_cpu(const Tensor<T> &A, const Tensor<T> &B, Tensor<T> &C)
+    {
+        for (int64_t i = 0; i < A.shape.size; ++i)
+        {
+            C.data[i] = A.data[i] + B.data[i];
+        }
+    }
+}
+
+namespace deepx::mps::kernels
+{
+#if defined(__APPLE__) && TARGET_OS_OSX && defined(__OBJC__)
+    inline deepx::mps::common::MetalKernelRuntime &elementwise_runtime()
+    {
+        static deepx::mps::common::MetalKernelRuntime rt("elementwise_miaobyte.metal");
+        return rt;
+    }
+
+    inline bool add_f32(const float *a, const float *b, float *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_f32", a, b, c, static_cast<uint32_t>(n), sizeof(float));
+    }
+
+#if defined(__FLT16_MANT_DIG__)
+    inline bool add_f16(const _Float16 *a, const _Float16 *b, _Float16 *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_f16", a, b, c, static_cast<uint32_t>(n), 2);
+    }
+#endif
+
+    inline bool add_i8(const int8_t *a, const int8_t *b, int8_t *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_i8", a, b, c, static_cast<uint32_t>(n), sizeof(int8_t));
+    }
+
+    inline bool add_i16(const int16_t *a, const int16_t *b, int16_t *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_i16", a, b, c, static_cast<uint32_t>(n), sizeof(int16_t));
+    }
+
+    inline bool add_i32(const int32_t *a, const int32_t *b, int32_t *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_i32", a, b, c, static_cast<uint32_t>(n), sizeof(int32_t));
+    }
+
+    inline bool add_i64(const int64_t *a, const int64_t *b, int64_t *c, int64_t n)
+    {
+        return elementwise_runtime().dispatch_binary_1d("add_i64", a, b, c, static_cast<uint32_t>(n), sizeof(int64_t));
+    }
+
+#else
+
+    inline bool add_f32(const float *, const float *, float *, int64_t) { return false; }
+
+#if defined(__FLT16_MANT_DIG__)
+    inline bool add_f16(const _Float16 *, const _Float16 *, _Float16 *, int64_t) { return false; }
+#endif
+
+    inline bool add_i8(const int8_t *, const int8_t *, int8_t *, int64_t) { return false; }
+    inline bool add_i16(const int16_t *, const int16_t *, int16_t *, int64_t) { return false; }
+    inline bool add_i32(const int32_t *, const int32_t *, int32_t *, int64_t) { return false; }
+    inline bool add_i64(const int64_t *, const int64_t *, int64_t *, int64_t) { return false; }
+
+#endif
+} // namespace deepx::mps::kernels
+
+#endif // DEEPX_TENSORFUNC_ELEMENTWISE_COMMON_HPP
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.hpp
new file mode 100644
index 00000000..ecc2057c
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.hpp
@@ -0,0 +1,62 @@
+#ifndef DEEPX_TENSORFUNC_ELEMENTWISE_MIAOBYTE_HPP
+#define DEEPX_TENSORFUNC_ELEMENTWISE_MIAOBYTE_HPP
+
+#include <stdexcept>
+#include <type_traits>
+
+#include "deepx/tensor.hpp"
+#include "deepx/tensorfunc/authors.hpp"
+#include "deepx/tensorfunc/elementwise_common.hpp"
+#include "deepx/tensorfunc/elementwise.hpp"
+namespace deepx::tensorfunc
+{
+    template <typename T>
+    struct addDispatcher<miaobyte, T>
+    {
+        static void add(const Tensor<T> &A, const Tensor<T> &B, Tensor<T> &C)
+        {
+            detail::assert_same_shape(A, B, C);
+
+#if defined(__APPLE__) && TARGET_OS_OSX
+            // Try Metal path for supported dtypes. Current tensors are host-backed,
+            // so this does staging copies (correctness-first). If Metal is unavailable,
+            // fall back to the CPU implementation below.
+            bool ok = false;
+            if constexpr (std::is_same_v<T, float>)
+            {
+                ok = deepx::mps::kernels::add_f32(A.data, B.data, C.data, A.shape.size);
+            }
+#if defined(__FLT16_MANT_DIG__)
+            else if constexpr (std::is_same_v<T, _Float16>)
+            {
+                ok = deepx::mps::kernels::add_f16(A.data, B.data, C.data, A.shape.size);
+            }
+#endif
+            else if constexpr (std::is_same_v<T, int8_t>)
+            {
+                ok = deepx::mps::kernels::add_i8(A.data, B.data, C.data, A.shape.size);
+            }
+            else if constexpr (std::is_same_v<T, int16_t>)
+            {
+                ok = deepx::mps::kernels::add_i16(A.data, B.data, C.data, A.shape.size);
+            }
+            else if constexpr (std::is_same_v<T, int32_t>)
+            {
+                ok = deepx::mps::kernels::add_i32(A.data, B.data, C.data, A.shape.size);
+            }
+            else if constexpr (std::is_same_v<T, int64_t>)
+            {
+                ok = deepx::mps::kernels::add_i64(A.data, B.data, C.data, A.shape.size);
+            }
+
+            if (ok)
+            {
+                return;
+            }
+#endif
+            detail::add_cpu(A, B, C);
+        }
+    };
+}
+
+#endif // DEEPX_TENSORFUNC_ELEMENTWISE_MIAOBYTE_HPP
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.metal b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.metal
new file mode 100644
index 00000000..bee51557
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/elementwise_miaobyte.metal
@@ -0,0 +1,58 @@
+#include <metal_stdlib>
+using namespace metal;
+
+// miaobyte elementwise add kernels (specialized per dtype)
+
+kernel void add_f32(device const float* A [[buffer(0)]],
+                    device const float* B [[buffer(1)]],
+                    device float*       C [[buffer(2)]],
+                    constant uint&      n [[buffer(3)]],
+                    uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = A[gid] + B[gid]; }
+}
+
+kernel void add_f16(device const half* A [[buffer(0)]],
+                    device const half* B [[buffer(1)]],
+                    device half*       C [[buffer(2)]],
+                    constant uint&     n [[buffer(3)]],
+                    uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = A[gid] + B[gid]; }
+}
+
+kernel void add_i8(device const char* A [[buffer(0)]],
+                   device const char* B [[buffer(1)]],
+                   device char*       C [[buffer(2)]],
+                   constant uint&     n [[buffer(3)]],
+                   uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = (char)(A[gid] + B[gid]); }
+}
+
+kernel void add_i16(device const short* A [[buffer(0)]],
+                    device const short* B [[buffer(1)]],
+                    device short*       C [[buffer(2)]],
+                    constant uint&      n [[buffer(3)]],
+                    uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = (short)(A[gid] + B[gid]); }
+}
+
+kernel void add_i32(device const int* A [[buffer(0)]],
+                    device const int* B [[buffer(1)]],
+                    device int*       C [[buffer(2)]],
+                    constant uint&    n [[buffer(3)]],
+                    uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = A[gid] + B[gid]; }
+}
+
+kernel void add_i64(device const long* A [[buffer(0)]],
+                    device const long* B [[buffer(1)]],
+                    device long*       C [[buffer(2)]],
+                    constant uint&     n [[buffer(3)]],
+                    uint gid [[thread_position_in_grid]])
+{
+    if (gid < n) { C[gid] = A[gid] + B[gid]; }
+}
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/init_miaobyte.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/init_miaobyte.hpp
new file mode 100644
index 00000000..66a1345f
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/init_miaobyte.hpp
@@ -0,0 +1,23 @@
+#ifndef DEEPX_TENSORFUNC_INIT_MIAOBYTE_HPP
+#define DEEPX_TENSORFUNC_INIT_MIAOBYTE_HPP
+
+#include <algorithm>
+
+#include "deepx/tensor.hpp"
+#include "deepx/tensorfunc/init.hpp"
+#include "deepx/tensorfunc/authors.hpp"
+
+namespace deepx::tensorfunc
+{
+    // fill/constant
+    template <typename T>
+    struct constantDispatcher<miaobyte, T>
+    {
+        static void constant(Tensor<T> &tensor, const T value)
+        {
+            std::fill(tensor.data, tensor.data + tensor.shape.size, value);
+        }
+    };
+}
+
+#endif // DEEPX_TENSORFUNC_INIT_MIAOBYTE_HPP
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/io_miaobyte.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/io_miaobyte.hpp
new file mode 100644
index 00000000..f5324b5e
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/io_miaobyte.hpp
@@ -0,0 +1,63 @@
+#ifndef DEEPX_TENSORFUNC_IO_MIAOBYTE_HPP
+#define DEEPX_TENSORFUNC_IO_MIAOBYTE_HPP
+
+#include <iostream>
+
+#include "deepx/tensor.hpp"
+#include "stdutil/vector.hpp"
+#include "stdutil/print.hpp"
+#include "stdutil/fs.hpp"
+#include "deepx/tensorfunc/authors.hpp"
+#include "deepx/tensorfunc/io.hpp"
+#include "deepx/tensorfunc/tensorlife_miaobyte.hpp"
+namespace deepx::tensorfunc
+{
+    // 通用模板特化
+    template <typename T>
+    struct printDispatcher<miaobyte, T>
+    {
+        static void print(const Tensor<T> &t, const std::string &f = "")
+        {
+            Tensor<void> vt;
+            vt.data = t.data;
+            vt.shape = t.shape;
+            vt.deleter = nullptr;
+            stdutil::print(t.shape.shape, t.data, t.shape.dtype, f);
+        }
+    };
+
+    // void类型的完全特化
+    template <>
+    struct printDispatcher<miaobyte, void>
+    {
+        static void print(const Tensor<void> &t, const std::string &f = "")
+        {
+            stdutil::print(t.shape.shape, t.data, t.shape.dtype, f);
+        }
+    };
+
+
+    //load
+    template <typename T>
+    pair<std::string,shared_ptr<Tensor<T>>> load(const std::string &path)
+    {
+        // 加载shape
+        pair<std::string,Shape> shape_name=Shape::loadShape(path);
+        Shape shape=shape_name.second;
+        std::string tensor_name=shape_name.first;
+ 
+
+        // 检查T 和 shape.dtype 是否匹配
+        if (shape.dtype != precision<T>())
+        {
+            throw std::runtime_error("调用load<" + precision_str(shape.dtype) + "> 不匹配: 需要 " + precision_str(shape.dtype) +
+                                     " 类型，但文件为" + precision_str(precision<T>()) + " 类型");
+        }
+ 
+        shared_ptr<Tensor<T>> tensor = make_shared<Tensor<T>>(New<T>(shape.shape));
+        tensor->loader(path+".data",tensor->data,tensor->shape.size);
+        return std::make_pair(tensor_name, tensor);
+    };
+
+}
+#endif // DEEPX_TENSORFUNC_IO_MIAOBYTE_HPP
\ No newline at end of file
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/mps_common.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/mps_common.hpp
new file mode 100644
index 00000000..43a2148d
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/mps_common.hpp
@@ -0,0 +1,192 @@
+#ifndef DEEPX_TENSORFUNC_MPS_COMMON_HPP
+#define DEEPX_TENSORFUNC_MPS_COMMON_HPP
+
+#if defined(__APPLE__)
+  #include <TargetConditionals.h>
+#endif
+
+#include <cstddef>
+#include <cstdint>
+#include <cstring>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#if defined(__APPLE__) && TARGET_OS_OSX
+  #if defined(__OBJC__)
+    #import <Foundation/Foundation.h>
+    #import <Metal/Metal.h>
+  #endif
+#endif
+
+namespace deepx::mps::common
+{
+#if defined(__APPLE__) && TARGET_OS_OSX && defined(__OBJC__)
+    class MetalKernelRuntime final
+    {
+    public:
+        MetalKernelRuntime(const char *metal_source_basename)
+            : metal_source_basename_(metal_source_basename ? metal_source_basename : "")
+        {
+            device_ = MTLCreateSystemDefaultDevice();
+            queue_ = device_ ? [device_ newCommandQueue] : nil;
+        }
+
+        bool valid() const { return device_ != nil && queue_ != nil; }
+
+        bool dispatch_binary_1d(const char *kernel_fn,
+                                const void *a,
+                                const void *b,
+                                void *c,
+                                uint32_t n,
+                                size_t elem_bytes)
+        {
+            if (!valid() || !kernel_fn)
+            {
+                return false;
+            }
+
+            @autoreleasepool
+            {
+                NSError *error = nil;
+                id<MTLComputePipelineState> pso = pipeline(kernel_fn, &error);
+                if (!pso)
+                {
+                    return false;
+                }
+
+                const size_t bytes = static_cast<size_t>(n) * elem_bytes;
+                id<MTLBuffer> bufA = [device_ newBufferWithBytes:a length:bytes options:MTLResourceStorageModeShared];
+                id<MTLBuffer> bufB = [device_ newBufferWithBytes:b length:bytes options:MTLResourceStorageModeShared];
+                id<MTLBuffer> bufC = [device_ newBufferWithLength:bytes options:MTLResourceStorageModeShared];
+                id<MTLBuffer> bufN = [device_ newBufferWithBytes:&n length:sizeof(n) options:MTLResourceStorageModeShared];
+                if (!bufA || !bufB || !bufC || !bufN)
+                {
+                    return false;
+                }
+
+                id<MTLCommandBuffer> cmd = [queue_ commandBuffer];
+                id<MTLComputeCommandEncoder> enc = [cmd computeCommandEncoder];
+                [enc setComputePipelineState:pso];
+                [enc setBuffer:bufA offset:0 atIndex:0];
+                [enc setBuffer:bufB offset:0 atIndex:1];
+                [enc setBuffer:bufC offset:0 atIndex:2];
+                [enc setBuffer:bufN offset:0 atIndex:3];
+
+                const NSUInteger w = pso.maxTotalThreadsPerThreadgroup;
+                const MTLSize threadsPerThreadgroup = MTLSizeMake(w, 1, 1);
+                const MTLSize threadsPerGrid = MTLSizeMake(n, 1, 1);
+                [enc dispatchThreads:threadsPerGrid threadsPerThreadgroup:threadsPerThreadgroup];
+                [enc endEncoding];
+                [cmd commit];
+                [cmd waitUntilCompleted];
+
+                std::memcpy(c, [bufC contents], bytes);
+                return true;
+            }
+        }
+
+    private:
+        id<MTLLibrary> library(NSError **error)
+        {
+            if (library_)
+            {
+                return library_;
+            }
+
+            NSFileManager *fm = [NSFileManager defaultManager];
+
+            // Offline metallib (built into build dir as default.metallib)
+            NSArray<NSString *> *libCandidates = @[
+                @"default.metallib",
+                @"build/default.metallib",
+                @"executor/op-mem-mps/build/default.metallib",
+            ];
+
+            for (NSString *rel in libCandidates)
+            {
+                NSString *path = [[fm currentDirectoryPath] stringByAppendingPathComponent:rel];
+                if (![fm fileExistsAtPath:path])
+                {
+                    continue;
+                }
+                NSURL *url = [NSURL fileURLWithPath:path];
+                library_ = [device_ newLibraryWithURL:url error:error];
+                if (library_)
+                {
+                    return library_;
+                }
+            }
+
+            // Fallback: compile from .metal source at runtime.
+            // We try both the basename and the common repo-relative paths.
+            const std::string src = metal_source_basename_.empty() ? std::string("elementwise_miaobyte.metal") : metal_source_basename_;
+            NSArray<NSString *> *srcCandidates = @[
+                [NSString stringWithUTF8String:src.c_str()],
+                [@"src/deepx/tensorfunc" stringByAppendingPathComponent:[NSString stringWithUTF8String:src.c_str()]],
+                [@"executor/op-mem-mps/src/deepx/tensorfunc" stringByAppendingPathComponent:[NSString stringWithUTF8String:src.c_str()]],
+            ];
+
+            for (NSString *rel in srcCandidates)
+            {
+                NSString *path = [[fm currentDirectoryPath] stringByAppendingPathComponent:rel];
+                if (![fm fileExistsAtPath:path])
+                {
+                    continue;
+                }
+                NSError *readErr = nil;
+                NSString *metalSource = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:&readErr];
+                if (!metalSource)
+                {
+                    continue;
+                }
+                MTLCompileOptions *opts = [MTLCompileOptions new];
+                library_ = [device_ newLibraryWithSource:metalSource options:opts error:error];
+                if (library_)
+                {
+                    return library_;
+                }
+            }
+
+            return nil;
+        }
+
+        id<MTLComputePipelineState> pipeline(const char *kernel_fn, NSError **error)
+        {
+            NSString *fnName = [NSString stringWithUTF8String:kernel_fn];
+            auto it = pipeline_cache_.find(fnName);
+            if (it != pipeline_cache_.end())
+            {
+                return it->second;
+            }
+
+            id<MTLLibrary> lib = library(error);
+            if (!lib)
+            {
+                return nil;
+            }
+
+            id<MTLFunction> fn = [lib newFunctionWithName:fnName];
+            if (!fn)
+            {
+                return nil;
+            }
+
+            id<MTLComputePipelineState> pso = [device_ newComputePipelineStateWithFunction:fn error:error];
+            if (pso)
+            {
+                pipeline_cache_.emplace(fnName, pso);
+            }
+            return pso;
+        }
+
+        std::string metal_source_basename_;
+        id<MTLDevice> device_ = nil;
+        id<MTLCommandQueue> queue_ = nil;
+        id<MTLLibrary> library_ = nil;
+        std::unordered_map<NSString *, id<MTLComputePipelineState>> pipeline_cache_;
+    };
+#endif
+} // namespace deepx::mps::common
+
+#endif // DEEPX_TENSORFUNC_MPS_COMMON_HPP
diff --git a/executor/op-mem-mps/src/deepx/tensorfunc/tensorlife_miaobyte.hpp b/executor/op-mem-mps/src/deepx/tensorfunc/tensorlife_miaobyte.hpp
new file mode 100644
index 00000000..304bc72d
--- /dev/null
+++ b/executor/op-mem-mps/src/deepx/tensorfunc/tensorlife_miaobyte.hpp
@@ -0,0 +1,73 @@
+#ifndef DEEPX_TENSORFUNC_TENSORLIFE_MIAOBYTE_HPP
+#define DEEPX_TENSORFUNC_TENSORLIFE_MIAOBYTE_HPP
+
+#include <algorithm>
+#include <string>
+
+#include "stdutil/fs.hpp"
+#include "deepx/tensor.hpp"
+#include "deepx/dtype_mps.hpp"
+#include "deepx/tensorfunc/tensorlife.hpp"
+
+namespace deepx::tensorfunc
+{
+    template <typename T>
+    static T *newFn(int size)
+    {
+        return new T[size];
+    }
+
+    template <typename T>
+    static void freeFn(T *data)
+    {
+        delete[] data;
+    }
+
+    template <typename T>
+    static void copyFn(T *src, T *dest, int size)
+    {
+        std::copy(src, src + size, dest);
+    }
+
+    template <typename T>
+    static void saveFn(T *data, size_t size, const std::string &path)
+    {
+        auto *udata = reinterpret_cast<unsigned char *>(data);
+        size_t bytes = size * sizeof(T);
+        stdutil::save(udata, bytes, path);
+    }
+
+    template <typename T>
+    static void loadFn(const std::string &path, T *data, int size)
+    {
+        auto *udata = reinterpret_cast<unsigned char *>(data);
+        size_t bytes = size * sizeof(T);
+        stdutil::load(path, udata, bytes);
+    }
+
+    template <typename T>
+    Tensor<T> New(const std::vector<int> &shapedata)
+    {
+        Shape shape(shapedata);
+        shape.dtype = precision<T>();
+
+        Tensor<T> tensor(shape);
+        tensor.deleter = freeFn<T>;
+        tensor.copyer = copyFn<T>;
+        tensor.newer = newFn<T>;
+        tensor.saver = saveFn<T>;
+        tensor.loader = loadFn<T>;
+
+        tensor.data = newFn<T>(shape.size);
+        return tensor;
+    }
+
+    template <typename T>
+    void copy(const Tensor<T> &src, Tensor<T> &dst)
+    {
+        dst.shape = src.shape;
+        dst.copyer(src.data, dst.data, src.shape.size);
+    }
+}
+
+#endif // DEEPX_TENSORFUNC_TENSORLIFE_MIAOBYTE_HPP
diff --git a/executor/op-mem-mps/swift.md b/executor/op-mem-mps/swift.md
index d5f99422..68fec32c 100644
--- a/executor/op-mem-mps/swift.md
+++ b/executor/op-mem-mps/swift.md
@@ -1,3 +1,13 @@
-# todo
+# op-mem-mps
 
-支持apple mps芯片
\ No newline at end of file
+macOS 上的 MPS/Metal 执行器工程已初始化，当前提供最小可运行骨架与设备探测。
+
+## 目录
+
+- CMake 配置与构建脚本
+- MPS 设备与上下文初始化（Objective-C++）
+- 最小示例入口
+
+## 入口
+
+- 构建与运行说明见 README
\ No newline at end of file
diff --git a/executor/op-mem-mps/test/tensorfunc/1_fill.cpp b/executor/op-mem-mps/test/tensorfunc/1_fill.cpp
new file mode 100644
index 00000000..1acc10e0
--- /dev/null
+++ b/executor/op-mem-mps/test/tensorfunc/1_fill.cpp
@@ -0,0 +1,21 @@
+#include <cassert>
+#include <vector>
+
+#include "deepx/tensor.hpp"
+#include "deepx/dtype.hpp"
+#include "deepx/tensorfunc/authors.hpp"
+#include "deepx/tensorfunc/tensorlife_miaobyte.hpp"
+#include "deepx/tensorfunc/init_miaobyte.hpp"
+#include "deepx/tensorfunc/io_miaobyte.hpp"
+
+using namespace deepx;
+using namespace deepx::tensorfunc;
+
+int main()
+{
+    auto t = New<float>({2, 3});
+    constant<miaobyte, float>(t, 3.5f);
+    print<miaobyte>(t, "%2.3f");
+
+    return 0;
+}
diff --git a/executor/op-mem-mps/test/tensorfunc/2_add.cpp b/executor/op-mem-mps/test/tensorfunc/2_add.cpp
new file mode 100644
index 00000000..9c1ff4d0
--- /dev/null
+++ b/executor/op-mem-mps/test/tensorfunc/2_add.cpp
@@ -0,0 +1,26 @@
+#include <cassert>
+#include <vector>
+
+#include "deepx/tensor.hpp"
+#include "deepx/dtype.hpp"
+#include "deepx/tensorfunc/authors.hpp"
+#include "deepx/tensorfunc/tensorlife_miaobyte.hpp"
+#include "deepx/tensorfunc/init_miaobyte.hpp"
+#include "deepx/tensorfunc/elementwise_miaobyte.hpp"
+#include "deepx/tensorfunc/io_miaobyte.hpp"
+
+using namespace deepx;
+using namespace deepx::tensorfunc;
+
+int main()
+{
+    auto a = New<float>({2, 2});
+    auto b = New<float>({2, 2});
+    auto c = New<float>({2, 2});
+
+    constant<miaobyte, float>(a, 1.25f);
+    constant<miaobyte, float>(b, 2.5f);
+    add<miaobyte, float>(a, b, c);
+    print<miaobyte>(c, "%2.3f");
+    return 0;
+}
diff --git a/executor/op-mem-mps/test/tensorfunc/CMakeLists.txt b/executor/op-mem-mps/test/tensorfunc/CMakeLists.txt
new file mode 100644
index 00000000..902027fc
--- /dev/null
+++ b/executor/op-mem-mps/test/tensorfunc/CMakeLists.txt
@@ -0,0 +1,10 @@
+add_executable(1_fill 1_fill.cpp)
+target_link_libraries(1_fill deepx_mps)
+
+add_executable(2_add 2_add.cpp)
+target_link_libraries(2_add deepx_mps)
+
+if(APPLE)
+	set_source_files_properties(1_fill.cpp PROPERTIES LANGUAGE OBJCXX)
+	set_source_files_properties(2_add.cpp PROPERTIES LANGUAGE OBJCXX)
+endif()
diff --git a/executor/op-mem-mps/vibe.md b/executor/op-mem-mps/vibe.md
new file mode 100644
index 00000000..0634e967
--- /dev/null
+++ b/executor/op-mem-mps/vibe.md
@@ -0,0 +1,132 @@
+# op-mem-mps 设计方案（Vibe）
+
+> 目标：为 macOS / Apple Silicon 提供基于 Metal Performance Shaders (MPS) 的执行器，实现与现有执行器一致的接口与行为。
+
+## 1. 目标与范围
+
+### 1.1 目标
+- 提供 MPS 后端执行器（op-mem-mps），对接 deepx 的 Tensor / TF / Op 体系。
+- 兼容现有网络通信与任务调度流程（UDP server + TF factory）。
+- 最小可用：支持设备探测、内存分配、张量生命周期与少量算子（init/io/elementwise）。
+
+### 1.2 非目标
+- 不包含跨平台抽象层的重构。
+- 不覆盖全部算子，一期仅实现核心子集。
+
+---
+
+## 2. 总体架构
+
+### 2.1 模块划分
+- client：入口、网络服务、TF 注册与调度
+- mem：MPS 设备/上下文与缓冲区管理
+- tensorfunc：算子实现（以作者/精度为维度）
+- tf：TF 封装与参数绑定
+
+### 2.2 关键组件
+1. MPSDevice
+   - 枚举与选择 MTLDevice
+   - 兼容 Apple Silicon 与 Intel + AMD（若支持）
+
+2. MPSContext
+   - 维护 MTLCommandQueue
+   - 统一的 command buffer 生命周期
+
+3. MPSBuffer / MemBase
+   - 统一的 Tensor 内存管理
+   - 与 deepx Tensor 对接
+
+---
+
+## 3. 数据流与执行流程
+
+front(py) -> UDP -> TFFactory -> TF.run -> tensorfunc -> MPS -> output
+
+- TF 负责参数解析与调度
+- tensorfunc 负责具体算子
+- mem 负责存储与同步
+
+---
+
+## 4. 目录与代码组织（建议）
+
+executor/op-mem-mps/
+  src/
+    client/
+      main.mm
+      tfs.cpp
+    deepx/
+      mem/
+      tf/
+      tensorfunc/
+      mps_device.{hpp,mm}
+      mps_context.{hpp,mm}
+
+---
+
+## 5. API 设计与契约
+
+### 5.1 与现有 executor 一致
+- register_all(TfFactory&)
+- TF::run(shared_ptr<MemBase>, string &error)
+- MemBase::gettensor<T>()
+
+### 5.2 MPS 约束
+- 必须在 command buffer 提交后同步读回
+- 统一使用 MTLStorageModeShared 以便 CPU 读取（一期）
+
+---
+
+## 6. 优先级路线图（MVP -> v1）
+
+### MVP
+- 设备探测 + MPSContext
+- 张量生命周期 (new, delete)
+- init (ones, zeros, arange)
+- elementwise (add, mul)
+
+### v1
+- matmul
+- reduce
+- changeshape
+
+---
+
+## 7. 构建与依赖
+
+- CMake + Objective-C++
+- 依赖：
+  - Metal / MetalPerformanceShaders
+  - yaml-cpp（保持与其他执行器一致）
+
+---
+
+## 8. 风险与约束
+
+- MPS 对部分精度支持有限（如 int8）
+- Metal buffer 与 CPU 共享模式性能有限
+- 异步执行与同步点控制复杂
+
+---
+
+## 9. 里程碑
+
+| 阶段 | 内容 | 时间 |
+|------|------|------|
+| MVP | 设备+内存+基础算子 | 2 周 |
+| v1  | 常用算子覆盖 | 4 周 |
+
+---
+
+## 10. 验证策略
+
+- 对比 op-mem-ompsimd 输出
+- 单元测试 + 前端 examples 回归
+
+---
+
+## 11. 需要确认的问题
+
+1. 目标最小支持的算子集？
+2. 是否允许 Metal shader 自定义 kernel？
+3. MPSGraph 是否允许用于算子拼接？