+
+*Figure 5: Top 1% of gradients may contribute over 85% of gradient norms.*
+
+---
+
+## ZenFlow Design
+
+ZenFlow is designed around three key ideas that separate critical and non-critical gradient updates while minimizing communication bottlenecks. Here's how we break the tight coupling between GPU and CPU computation to create a **stall-free** pipeline.
+
+### Idea 1: Importance-Aware Top-k Gradient Update
+
+Not all gradients are equally impactful for training. ZenFlow introduces an **importance-aware** design that prioritizes updates for the top-k most significant gradients. These gradients are updated directly on the GPU, using its high compute bandwidth. This approach allows us to **reduce the size of the per-step gradient update** by nearly **50%**, cutting down the communication load by around 2×.
+
+For the rest of the gradients, which contribute less to the model's learning, ZenFlow batches them and performs asynchronous updates on the CPU. These updates are **deferred** until they are sufficiently accumulated, thereby reducing the impact on training speed.
+
+### Idea 2: Bounded-Asynchronous CPU Accumulation
+
+ZenFlow’s **asynchronous accumulation** allows the CPU to stay busy while the GPU performs other computations. We apply an **accumulation window** for the non-critical gradients, allowing them to accumulate over several iterations before updating. This gives ZenFlow the ability to process **multiple rounds of gradient updates** concurrently, eliminating idle time typically spent waiting for the CPU optimizer.
+
+By carefully coordinating CPU updates with GPU execution, ZenFlow **fully hides CPU execution** behind GPU computation—ensuring that GPUs remain actively utilized, avoiding stalls, and **maximizing hardware efficiency**.
+
+### Idea 3: Lightweight Gradient Selection
+
+A key challenge in distributed training is **selecting important gradients** without introducing prohibitive communication and GPU memory costs. Traditional systems rely on global synchronization (via `AllGather`) to gather full gradients, which can become a major bottleneck in multi-GPU settings.
+
+ZenFlow solves this with a **lightweight gradient proxy**: instead of transferring full gradients, ZenFlow uses a **per-column gradient norm** to approximate the importance of each gradient. By computing a compact summary of per-column gradients (e.g., squared norms), ZenFlow reduces communication volume by more than **4,000×**—with nearly no loss in accuracy.
+
+This approach allows ZenFlow to **scale efficiently across GPUs**, without high memory or communication overhead, and it supports **dynamic gradient selection** as the model evolves.
+
+### Putting It All Together: ZenFlow’s Zero-Stall Pipeline
+
+
+

+
+
+
+
+*Figure 6: ZenFlow’s stall-free pipeline overlaps CPU updates and transfers with multi-steps GPU compute.*
+
+1. **Forward/Backward Pass on GPU:** ZenFlow processes the forward and backward passes on the GPU, immediately updating the **top-k gradients** on the GPU without waiting for the CPU.
+
+2. **Gradient Transfer to CPU:** While the GPU is busy, gradients from the current iteration (or previous ones) are transferred to the CPU over a dedicated PCIe stream. This is done in parallel with GPU computation, without causing any GPU wait time.
+
+3. **CPU Update:** Once a batch of non-critical gradients has accumulated, the CPU performs the update asynchronously. This update typically spans multiple GPU iterations, but is hidden behind GPU work, making it virtually invisible to the overall pipeline.
+
+4. **Double Buffering:** ZenFlow uses **double buffering** to manage the newly updated gradients. When the CPU update is complete, the new parameters are transferred back to the GPU. The swap is as fast as a pointer flip—no need to reload the entire model or re-launch the kernel.
+
+By constantly **overlapping GPU computation with CPU-side work**, ZenFlow transforms the traditional compute → wait → update cycle into a continuous, **stall-free pipeline**.
+
+---
+
+## Getting Started: Try out DeepSpeed-ZenFlow
+
+To try out DeepSpeed-ZenFlow, please refer to the [ZenFlow tutorial](https://github.com/deepspeedai/DeepSpeedExamples/blob/master/training/DeepSpeed-ZenFlow/README.md) in our DeepSpeedExamples repo.
+
+---
+
+## Citation
+
+```bibtex
+@article{lan2025zenflow,
+ title = {ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates},
+ author = {Tingfeng Lan and Yusen Wu and Bin Ma and Zhaoyuan Su and Rui Yang and Tekin Bicer and Masahiro Tanaka and Olatunji Ruwase and Dong Li and Yue Cheng},
+ journal = {arXiv preprint arXiv:2505.12242},
+ year = {2025}
+}
+```
+
+---
+
+## Acknowledgements
+
+This work is the result of a close collaboration between University of Virginia (UVA), University of California, Merced (UC Merced), Argonne National Laboratory (ANL) and DeepSpeed team.
+
+The contributors include [Tingfeng Lan](https://antlera.github.io/), [Yusen Wu](https://joshwoo2003.github.io/), [Zhaoyuan Su](https://alexsssu.github.io/), [Rui Yang](https://ruiyang00.github.io/), and [Yue Cheng](https://tddg.github.io/) from UVA; [Bin Ma](https://www.linkedin.com/in/bin-ma-ba665b182/) and [Dong Li](https://faculty.ucmerced.edu/dong-li/) from UC Merced; [Tekin Bicer](https://www.anl.gov/profile/tekin-bicer) from ANL; [Olatunji Ruwase](https://www.linkedin.com/in/tunji-ruwase-088952/) and [Masahiro Tanaka](https://www.linkedin.com/in/masahiro-tanaka-77482926/) from the DeepSpeed team. We especially thank [Olatunji Ruwase](https://www.linkedin.com/in/tunji-ruwase-088952/) and [Masahiro Tanaka](https://www.linkedin.com/in/masahiro-tanaka-77482926/) for their early feedback and insightful discussions and also for open-source community support.
diff --git a/blogs/deepspeed-zenflow/images/zenflow-example.png b/blogs/deepspeed-zenflow/images/zenflow-example.png
new file mode 100644
index 000000000000..316e8123eccf
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-example.png differ
diff --git a/blogs/deepspeed-zenflow/images/zenflow-gradients.png b/blogs/deepspeed-zenflow/images/zenflow-gradients.png
new file mode 100644
index 000000000000..017d5e7ba0a7
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-gradients.png differ
diff --git a/blogs/deepspeed-zenflow/images/zenflow-logo.png b/blogs/deepspeed-zenflow/images/zenflow-logo.png
new file mode 100644
index 000000000000..1e6021d36e98
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-logo.png differ
diff --git a/blogs/deepspeed-zenflow/images/zenflow-no-overlap.png b/blogs/deepspeed-zenflow/images/zenflow-no-overlap.png
new file mode 100644
index 000000000000..7995d8d4daa0
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-no-overlap.png differ
diff --git a/blogs/deepspeed-zenflow/images/zenflow-overview.png b/blogs/deepspeed-zenflow/images/zenflow-overview.png
new file mode 100644
index 000000000000..c6d4e41132a8
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-overview.png differ
diff --git a/blogs/deepspeed-zenflow/images/zenflow-workflow.png b/blogs/deepspeed-zenflow/images/zenflow-workflow.png
new file mode 100644
index 000000000000..6f704f7a48ec
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zenflow-workflow.png differ
diff --git a/blogs/deepspeed-zenflow/images/zero-offload-stall.png b/blogs/deepspeed-zenflow/images/zero-offload-stall.png
new file mode 100644
index 000000000000..f68f4421af33
Binary files /dev/null and b/blogs/deepspeed-zenflow/images/zero-offload-stall.png differ