Skip to content

shen-shanshan/cs-self-learning

Repository files navigation

Computer Science Self-Learning Notes

Oryx Video-ChatGPT

📌 Overview

This repository archives my notes and materials during my computer science self-learning jouney. Currently, I mainly focus on LLM/VLM inference engine and GPU/NPU computing, thus I have gathered many technical blogs for AI infra beginners and MLSys papers for researchers.

🔍 Contents:

In addition, I have also published some technical blogs on the internet, you can read them at links below.

😊 Welcome to star this repository!

Oryx Video-ChatGPT

📚 Learning Notes

🧱 Basic Knowledges

🤖 AI

🚀 Backend & Big Data

🛠️ Tools

🔗 Others

📚 Technical Blogs

📖 Basic Knowledges

Title Category Author Note Rec Read
The Illustrated Transformer Transformer @Jay Alammar Transformer 原理详解 ⭐️⭐️⭐️⭐️⭐️
The Illustrated GPT-2 (Visualizing Transformer Language Models) Transformer @Jay Alammar Transformer 推理过程 ⭐️⭐️⭐️⭐️⭐️
图文详解 LLM inference:KV Cache KV Cache @季叶 ⭐️⭐️⭐️
Mixture of Experts Explained MoE @HuggingFace Blog MoE 综述 ⭐️⭐️⭐️⭐️
MoE 并行负载均衡:EPLB 的深度解析与可视化 MoE @kaiyuan ⭐️⭐️⭐️
LLM 推理并行优化的必备知识 Parallel Strategy @kaiyuan
分布式推理优化思路 Parallel Strategy @kaiyuan
The Ultra-Scale Playbook: Training LLMs on GPU Clusters Parallel Strategy @HuggingFace Blog
图解大模型计算加速系列:分离式推理架构 1,从 DistServe 谈起/u> PD Disaggregation @猛猿 PD 分离原理详解 ⭐️⭐️⭐️⭐️
图解大模型计算加速系列:分离式推理架构 2,模糊分离与合并边界的 chunked-prefills Schedule @猛猿 ⭐️⭐️⭐️⭐️
LLM 推理提速:Attention 与 FFN 分离方案解析 AF Disaggregation @kaiyuan AF 分离原理详解 ⭐️⭐️⭐️
Step-3 AF 分离推理系统 vs Deepseek EP 推理系统,谁更好? AF Disaggregation @不归牛顿管的熊猫 AF 分离与大 EP 优劣对比 ⭐️⭐️
Step-3 推理系统:从 PD 分离到 AF 分离(AFD) AF Disaggregation @Yibo Zhu Step3 作者杂谈 ⭐️⭐️
GPU 内存(显存)的理解与基本使用 Hardware @kaiyuan ⭐️⭐️⭐️⭐️

📖 Dive into vLLM

Title Category Author Note Rec Read
Inside vLLM: Anatomy of a High-Throughput LLM Inference System Overview @vLLM Blog vLLM 全面详解 ⭐️⭐️⭐️⭐️⭐️
vLLM V1 整体流程|从请求到算子执行 Architecture @SSS不知-道 vLLM 推理流程 ⭐️⭐️⭐️⭐️⭐️
图解 vLLM V1 系列 1:整体流程 Architecture @猛猿 ⭐️⭐️⭐️
图解 vLLM V1 系列 2:Executor-Workers 架构 Architecture @猛猿 ⭐️⭐️⭐️
图解 vLLM V1 系列 3:KV Cache 初始化 KV Cache @猛猿 ⭐️⭐️⭐️
图解 vLLM V1 系列 4:加载模型权重 Model @猛猿 ⭐️⭐️
vLLM 模型权重加载:使用 setattr Model @风之魔术师 ⭐️⭐️
ColumnParallelLinear 和 RowParallelLinear Model @风之魔术师 ⭐️⭐️
图解 vLLM V1 系列 5:调度器策略 Scheduler @猛猿 ⭐️⭐️⭐️⭐️
Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU Platform @The Ascend Team on vLLM vLLM 硬件插件化机制 ⭐️⭐️⭐️
vLLM 算力多样性|Platform 插件与 CustomOp Platform @SSS不知-道 ⭐️⭐️⭐️⭐️
vLLM 算子开发流程:“保姆级”详细记录 Kernel @DefTruth ⭐️⭐️⭐️⭐️⭐️
Introduction to torch.compile and How It Works with vLLM Graph @vLLM Blog ⭐️⭐️
vLLM torch.compile Integration Graph @Jiangyun Zhu 自定义 Pass 方法 ⭐️⭐️⭐️
vLLM 为什么没在 Prefill 阶段支持 Cuda Graph? Graph @kaiyuan ⭐️⭐️⭐️
vLLM 显存管理详解 Memory @kaiyuan ⭐️⭐️⭐️⭐️
vLLM DP 特性与演进方案分析 Parallel Strategy @kaiyuan ⭐️⭐️⭐️⭐️
LLM 推理数据并行负载均衡(DPLB)浅析 Parallel Strategy @kaiyuan ⭐️⭐️⭐️
vLLM PD 分离方案浅析 PD Disaggregation @kaiyuan ⭐️⭐️⭐️
vLLM PD 分离 KV Cache 传递机制详解与演进分析 PD Disaggregation @kaiyuan ⭐️⭐️⭐️
vLLM 结构化输出|Guided Decoding (V0) Guided Decoding @SSS不知-道 ⭐️⭐️⭐️
vLLM 结构化输出|Guided Decoding (V1) Guided Decoding @SSS不知-道 ⭐️⭐️⭐️
vLLM 多模态推理|卷积计算加速 Multi-Modal @SSS不知-道 ⭐️⭐️

📖 Dive into PyTorch

Title Category Author Note Rec Read
PyTorch 显存管理介绍与源码解析(一) Memory @kaiyuan ⭐️⭐️⭐️⭐️
PyTorch 显存可视化与 Snapshot 数据分析 Memory @kaiyuan ⭐️⭐️⭐️⭐️

📖 CUDA Programming

Title Category Author Note Rec Read
CUDA 内核优化策略 Performance @Zhang ⭐️⭐️⭐️
从啥也不会到 CUDA GEMM 优化 Performance @猛猿 ⭐️⭐️⭐️⭐️

📖 Communication

Title Category Author Note Rec Read
NCCL: Collective Operations Collective Communication @NVIDIA Developer 集合通信常用操作 ⭐️⭐️⭐️⭐️⭐️
一文读懂|RDMA 原理 Network @Linux内核库 ⭐️⭐️⭐️

📖 Multi-Modality

Title Category Author Note Rec Read
多模态技术梳理:ViT 系列 ViT @姜富春 ViT 研究综述 ⭐️⭐️⭐️
ViT 论文速读 ViT @Zhang ⭐️⭐️
LLaVA 系列模型结构详解 ViT @Zhang ⭐️⭐️⭐️

📖 Dive into Qwen

Title Category Author Note Rec Read
多模态技术梳理:Qwen-VL 系列 VL @姜富春 ⭐️⭐️⭐️⭐️
Qwen2-VL 源码解读:从准备一条样本到模型生成全流程图解 VL @姜富春 ⭐️⭐️⭐️⭐️⭐️
万字长文图解 Qwen2.5-VL 实现细节 VL @猛猿 ⭐️⭐️⭐️⭐️⭐️

📖 Dive into DeepSeek

Title Category Author Note Rec Read
DeepSeek 技术解读(1)- 彻底理解 MLA(Multi-Head Latent Attention) Attention @姜富春 ⭐️⭐️⭐️⭐️
DeepSeek 技术解读(2)- MTP(Multi-Token Prediction)的前世今生 Parallel Decoding @姜富春 ⭐️⭐️⭐️⭐️
DeepSeek 技术解读(3)- MoE 的演进之路 MoE @姜富春 ⭐️⭐️⭐️⭐️

📖 Development

Title Category Author Note Rec Read
LLM Inference 高效 Debug 方法汇总 Debug @CarryPls vLLM Debug 经验 ⭐️⭐️
推理性能优化:GPU/NPU Profiling 阅读引导 Profiling @kaiyuan ⭐️⭐️⭐️⭐️

📚 Papers

Refer to How to Read a Paper to master a practical and efficient three-pass method for reading research papers.

Clarification for symbols in the following tables:

  • ✅: The first pass that gives you a general idea about the paper.
  • ✅ ✅: The second pass that lets you grasp the paper's content, but not its details.
  • ✅ ✅ ✅: The third pass that helps you understand the paper in depth.

📖 LLM Backbone

Title Date arXiv GitHub Note Read
Mamba: Linear-Time Sequence Modeling with Selective State Spaces 2023/12 arXiv Mamba link
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality 2024/05 arXiv

📖 LLM Inference Survey

Title Date arXiv GitHub Note Read

📖 Framework

Title Date arXiv GitHub Note Read
Efficient Memory Management for Large Language Model Serving with PagedAttention 2023/09 arXiv vLLM ✅ ✅ ✅
SGLang: Efficient Execution of Structured Language Model Programs 2023/12 arXiv SGLang
A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency 2025/05 arXiv

📖 Schedule

Title Date arXiv GitHub Note Read

📖 Speculative Decoding

Title Date arXiv GitHub Note Read
Blockwise Parallel Decoding for Deep Autoregressive Models 2018/11 arXiv
Fast Inference from Transformers via Speculative Decoding 2022/11 arXiv link
Accelerating Large Language Model Decoding with Speculative Sampling 2023/02 arXiv
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification 2023/05 arXiv link
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding 2024/01 arXiv
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads 2024/01 arXiv link
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty 2024/01 arXiv link
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding 2024/02 arXiv link
Accelerating Production LLMs with Combined Token/Embedding Speculators 2024/04 arXiv
Better & Faster Large Language Models via Multi-token Prediction 2024/04 arXiv
Optimizing Speculative Decoding for Serving Large Language Models Using Goodput 2024/06 arXiv
Scaling Speculative Decoding with Lookahead Reasoning 2025/06 arXiv

📖 Guided Decoding

Title Date arXiv GitHub Note Read
Robust Text-to-SQL Generation with Execution-Guided Decoding 2018/07 arXiv
Efficient Guided Generation for Large Language Models 2023/07 arXiv Outlines ✅ ✅
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models 2024/11 arXiv XGrammar
Pre3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation 2025/06 arXiv

📖 Long Sequence Processing

Title Date arXiv GitHub Note Read

📖 Memory Offloading

Title Date arXiv GitHub Note Read
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache 2024/01 arXiv
ProMoE: Fast MoE-based LLM Serving using Proactive Caching 2024/10 arXiv

📖 Large Scale Serving

Title Date arXiv GitHub Note Read
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications 2025/03 arXiv
Serving Large Language Models on Huawei CloudMatrix384 2025/06 arXiv

📖 Load Balancing

Title Date arXiv GitHub Note Read
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection 2024/11 arXiv
Pro-Prophet: A Systematic Load Balancing Method for Efficient Parallel Training of Large-scale MoE Models 2024/11 arXiv

📖 KVCache Store

Title Date arXiv GitHub Note Read
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion 2024/05 arXiv LMCache
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving 2024/07 arXiv Mooncake

📖 Disaggregated Architecture

Title Date arXiv GitHub Note Read
Splitwise: Efficient generative LLM inference using phase splitting 2023/11 arXiv splitwise-sim link
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving 2024/01 arXiv DistServe
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism 2025/04 arXiv link ✅ ✅
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding 2025/07 arXiv Step3, StepMesh link ✅ ✅ ✅
xDeepServe: Model-as-a-Service on Huawei CloudMatrix384 2025/08 arXiv

📖 Elasticity and Fault Tolerance

Title Date arXiv GitHub Note Read
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models 2024/01 arXiv ServerlessLLM link ✅ ✅
Expert-as-a-Service: Towards Efficient, Scalable, and Robust Large-scale MoE Serving 2025/09 arXiv link ✅ ✅
ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training 2025/10 arXiv link ✅ ✅
MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs 2025/10 arXiv link ✅ ✅
From Models to Operators: Rethinking Autoscaling Granularity for Large Generative Models 2025/11 arXiv link

📚 Learning Projects

Project Category Author/Organization About
llm-action LLM @liguodongiot 本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)。
awesomeMLSys MLSys @GPU MODE An ML Systems Onboarding list.
InfraTech MLSys @CalvinXKY 分享AI Infra知识&代码练习:PyTorch/vLLM/SGLang框架入门⚡️、性能加速🚀、大模型基础🧠、AI软硬件🔧等。
AI-Infra-from-Zero-to-Hero MLSys @HuaizhengZhang 🚀 Awesome System for Machine Learning ⚡️ AI System Papers and Industry Practice. ⚡️ System for Machine Learning, LLM (Large Language Model), GenAI (Generative AI). 🍻 OSDI, NSDI, SIGCOMM, SoCC, MLSys, etc. 🗃️ Llama3, Mistral, etc. 🧑‍💻 Video Tutorials.
resource-stream CUDA @GPU MODE GPU programming related news and material links.
BasicCUDA CUDA @CalvinXKY A tutorial for CUDA & PyTorch.

©️ Citation

@misc{cs-self-learning@2023,
  title  = {cs-self-learning},
  url    = {https://github.com/shen-shanshan/cs-self-learning},
  note   = {Open-source software available at https://github.com/shen-shanshan/cs-self-learning},
  author = {shen-shanshan},
  year   = {2023}
}

📜 License

MIT License, find more details here.

⭐ Star History

Star History Chart

About

This repo is used for archiving my notes, codes and materials of cs learning.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors