Notes and practical notebooks from MIT 6.5940, (Fall 2023) : TinyML and Efficient Deep Learning Computing lecture.
This course introduces efficient deep learning computing techniques that enable powerful deep learning applications on resource-constrained devices. The main focus is on achieving maximal performance with minimal resource consumption.
Upon completion of this course, you will be able to:
- Shrink and Accelerate Models: Master techniques like Pruning, Quantization (INT8/INT4), and Knowledge Distillation to dramatically reduce model size and inference latency.
- Design Efficient Architectures: Utilize Neural Architecture Search (NAS), specifically Once-for-All (OFA), to automatically design hardware-aware networks.
- Master LLM Efficiency: Apply Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA and QLoRA for efficient adaptation of multi-billion parameter models.
- Optimize Distributed Systems: Implement Data, Pipeline, and Tensor Parallelism for efficient training of models that exceed single-GPU memory.
- Deploy to the Edge (TinyML): Design models and system software (MCUNet, TinyEngine) capable of running complex AI on microcontrollers with Kilobytes of RAM.
- Explore Future Computing: Understand the fundamentals of Quantum Machine Learning (QML) and implement Noise Mitigation techniques for current NISQ hardware.
- Programming: Strong proficiency in Python 3.
- Frameworks: Experience with PyTorch (primary framework) or TensorFlow.
- Math: Comfort with Linear Algebra, Calculus, and Probability.
- Prerequisites: Familiarity with standard deep learning concepts (CNNs, RNNs, basic optimizers).
| Lecture | Topic (notes) | Slide | Notebook | Reference |
|---|---|---|---|---|
| L1 | Introduction | Slides | — | Video |
| L2 | Basics of Deep Learning | Slides | L02_NN_Basics.ipynb | Video |
| Lecture | Topic (notes) | Slide | Notebook | Reference |
|---|---|---|---|---|
| L12 | Transformer and LLM | Slides | — | Video |
| L13 | Efficient LLM Deployment | Slides | — | Video |
| L14 | LLM Post Training | Slides | — | Video |
| L15 | Long Context LLM | Slides | — | Video |
| L16 | Vision Transformer | Slides | L16_LLM_QLoRA_Finetuning.ipynb | Video |
| L17 | GAN, Video, and Point Cloud | Slides | — | Video |
| L18 | Diffusion Model | Slides | — | Video |
| Lecture | Topic (notes) | Slide | Notebook | Reference |
|---|---|---|---|---|
| L19 | Distributed Training (Part I) | Slides | — | Video |
| L20 | Distributed Training (Part II) | Slides | — | Video |
| L21 | On-Device Training and Transfer Learning | Slides | — | Video |
| Lecture | Topic (notes) | Slide | Notebook | Reference |
|---|---|---|---|---|
| L22 | Course Summary + Quantum ML I | Slides | — | Video |
| L23 | Quantum Machine Learning II | Slides | L23_QML_Noise_Mitigation.ipynb | Video |
| L24 | Final Project Presentation | Slides | — | Video |
| L25 | Final Project Presentation | Slides | — | Video |
| L26 | Final Project Presentation | Slides | — | Video |
All lab exercises are designed to provide hands-on experience with real-world frameworks:
- LLM Deployment: Hands-on experience deploying and running QLoRA-tuned LLMs (e.g., Llama-2) directly on a local GPU or CPU.
- TinyML: Utilizing the TinyEngine and TensorFlow Lite Micro frameworks for model deployment on simulated microcontroller environments.
- QML: Using Qiskit and Pennylane to build, train, and mitigate noise in variational quantum circuits.
For further advanced projects the course provided a set of state-of-the-art research challenges in efficient ML to explore.
- [cite_start]Goal: Address the challenge of efficient video analysis by leveraging Temporal Shift Module (TSM), which captures temporal relationships without adding computational cost[cite: 8, 10].
- [cite_start]Description: TSM works by shifting part of the channels along the temporal dimension, facilitating information exchange among neighboring frames[cite: 9]. [cite_start]Projects could involve changing the backbone (e.g., from MobileNetV2) or applying TSM to a new video task like fall detection[cite: 14, 15].
- [cite_start]Goal: Accelerate image editing in deep generative models by avoiding the re-synthesis of unedited regions[cite: 34, 35].
- [cite_start]Description: SIGE (Sparse Inference GEnerator) is a sparse engine that caches and reuses feature maps from the original image to generate only the edited regions[cite: 36]. [cite_start]The project focuses on integrating SIGE with Stable Diffusion XL (SDXL) to assess and potentially achieve more significant speed improvements[cite: 37, 39].
- [cite_start]Goal: Achieve high-throughput, real-time serving of low-precision quantized LLMs (like INT4) in cloud-based settings[cite: 165, 182].
- Description: The project centers on implementing an online, real-time serving system using the QServe library, which utilizes the QoQ (W4A8KV4) quantization algorithm. [cite_start]The final objective is to build an online Gradio demo to serve these highly-efficient, quantized LLMs[cite: 168, 183, 185].
Full documentation and project details can be found here.
- Course Youtube Series: EfficientML.ai Course | 2023 Fall | MIT 6.5940
- Course Slides
- Final project list (2023- 2024): EfficientML.ai Project Ideas
Special thanks to:
- Professor Song Han (MIT/HAN Lab) for his tremendous effort and passion in developing the EfficientML.ai framework and for making this cutting-edge research accessible to everyone.
- Yifan Lu for his dedication to making the course homework and lab materials publicly accessible and available for the community (All Homeworks Labs Accessible).