A framework for dynamically converting Transformer-based models into Mixtures of Experts (MoE) architectures.
The model/ directory contains the core implementation for converting standard transformer models (specifically LLaMA-based models) into dynamic Mixture of Experts architectures.
- init.py: Exports the main classes and sets up the device (CUDA if available, CPU otherwise).
- main.py: Entry point for the application, with a simple interface to run MoE conversion experiments.
- moe.py: Contains the
MixtureOfExpertsclass, a general-purpose MoE implementation with top-k routing. - expert.py: Implements
ExpertLayer, the individual expert modules used within MoE layers. - llama_moe.py: Contains
LLaMaMoEBlock, a modified LLaMA transformer block with MoE replacing the standard FFN. - run_experiment.py: Orchestrates experiments for converting standard models to MoE, with layer-by-layer testing.
- utils.py: Utility functions for loading models and other helper functions.
- pyproject.toml & poetry.lock: Package dependency management for the project.
The core MoE implementation that:
- Uses a gating/routing mechanism to direct tokens to the most relevant experts
- Supports top-k routing (can select multiple experts per token)
- Maintains the same interface as a standard feed-forward neural network
A drop-in replacement for standard LLaMA transformer blocks that:
- Preserves the original attention mechanism
- Replaces the feed-forward network with an MoE layer
- Maintains compatibility with the original model's interface
- Initializes expert weights from the original model's parameters
The run_experiment.py file provides:
- Automatic conversion of standard models to MoE architectures
- Layer-by-layer testing to ensure model stability
- Options to control how many and which layers to convert
- Fallback mechanisms to revert problematic conversions
To convert a model to use MoE:
from model.run_experiment import run_moe_conversion
model, tokenizer = run_moe_conversion(
model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
num_experts=4, # Number of experts per MoE layer
top_k=2, # How many experts to route each token to
conversion_frequency=4 # Convert every 4th layer
)The framework automatically handles model loading, conversion, testing, and can optionally save the converted model.
- Python 3.10 or higher
- Poetry (dependency management)
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Clone the repository:
git clone https://github.com/yourusername/dynamic-moe.git cd dynamic-moe -
Install dependencies:
cd model poetry install
It is recommended to run this project in a GPU-acceleratd environment. See this guide to launch and SSH into a GPU-accelerated EC2 instance: https://docs.google.com/document/d/1Hky7NQRuBwpyDnrvl4j9ksPi_ORabHdHzkXRHhqhxZg/edit?tab=t.0
To run the default conversion experiment:
cd model
poetry run python -m mainThis will:
- Load the TinyLlama-1.1B-Chat model
- Convert layers to MoE architecture
- Run tests to ensure stability
- Save the converted model