Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming (IEEE TMM Accepted)
A reinforcement learning framework for adaptive video streaming that leverages Plasticity-Aware Mixture of Experts (PA-MoE) to handle Quality of Experience (QoE) shifts across different content types (documentary, live, news). This project implements various MoE architectures with PPO (Proximal Policy Optimization) to achieve robust bitrate adaptation under changing QoE preferences.
- Multiple MoE Architectures: Supports MLP, MoE, SparseMoE, and Plasticity-Aware MoE (PA-MoE) for both policy and value networks
- QoE Shift Handling: Adapts to different content types with varying QoE preferences (bitrate, rebuffering, smoothness)
- PPO-based Training: Implements PPO algorithm with GAE (Generalized Advantage Estimation) for stable learning
- Comprehensive Evaluation: Includes comparison with baseline methods (BOLA, Buffer-based, Rate-based, RobustMPC, Pensieve, Merina)
- Rich Visualizations: Generates performance plots including CDF, QoE components, bitrate-rebuffer trade-offs, and smoothness analysis
- Plasticity Mechanisms: Implements expert noise injection and dormant unit detection for better adaptation
- Installation
- Quick Start
- Project Structure
- Usage
- Configuration
- Results
- Architecture
- Citation
- License
- Python 3.9
- CUDA-capable GPU (optional, but recommended)
- Conda (recommended) or pip
-
Clone the repository
git clone https://github.com/tinyzqh/PA-MoE cd PA-MoE -
Create conda environment
conda create -n pamoe python=3.9 conda activate pamoe
-
Install dependencies
pip install -r requirements.txt
-
Install OptiVerse (for performance evaluation)
pip install git+https://github.com/tinyzqh/OptiVerse.git
Train a PA-MoE agent with default configuration (distill moe is old name of plasticity aware MoE.):
export PYTHONPATH=.
python src/ppo.py --envs_model change --seed 1 --policy_type dmoe --value_type dmoeRun evaluation pipeline to compare different methods and generate plots:
python src/plots_perform/pipeline.pyThis will generate performance plots in the assert/ directory:
cdf_test.pdf: Cumulative distribution function of QoEqoe_bar_test.pdf: QoE component breakdownbitrate_rebuf_test.pdf: Bitrate vs rebuffering trade-offsmo_rebuffer_test.pdf: Smoothness vs rebuffering trade-offbitrate_smo_test.pdf: Bitrate vs smoothness trade-off
PA-MoE/
βββ src/
β βββ ppo.py # Main training script
β βββ configs/
β β βββ ppo_config.py # PPO configuration
β βββ envs/
β β βββ StreamingEnv.py # Adaptive streaming environment
β β βββ trace/ # Network trace data
β β βββ envivio/ # Video chunk size data
β βββ network/
β β βββ mlp/ # MLP baseline
β β βββ moe/ # Standard MoE
β β βββ sparse_moe/ # Sparse MoE
β β βββ distill_moe/ # PA-MoE (DistillMoE)
β βββ plots_perform/
β β βββ pipeline.py # Evaluation pipeline
β β βββ agents/ # Agent implementations
β β β βββ pamoe_agent.py
β β β βββ ppo_agent.py
β β β βββ ...
β β βββ weights/ # Pre-trained model weights
β βββ plots/ # Analysis plots
β βββ utils/ # Utility functions
βββ assert/ # Generated plots and results
βββ train.sh # Training script
βββ requirements.txt # Python dependencies
βββ README.md
The training script supports various configurations:
python src/ppo.py \
--envs_model change \ # Environment mode: "normal" or "change"
--seed 1 \ # Random seed
--policy_type dmoe \ # Policy network: "mlp", "moe", "smoe", "dmoe"
--value_type dmoe \ # Value network: "mlp", "moe", "smoe", "dmoe"
--total_timesteps 2000000 \ # Total training timesteps
--learning_rate 1e-3 \ # Learning rate
--track # Enable wandb trackingKey Parameters:
envs_model:"normal": Fixed content type (documentary)"change": Varying content types (documentary β live β news β ...)
policy_type/value_type: Network architecture"mlp": Multi-layer perceptron baseline"moe": Standard Mixture of Experts"smoe": Sparse MoE"dmoe": PA-MoE (DistillMoE with plasticity mechanisms)
Training Modes:
- Normal mode: Train on a single content type (documentary)
- Change mode: Train with QoE shifts across different content types
The evaluation pipeline (src/plots_perform/pipeline.py) compares multiple methods:
- Rule-based: Buffer-based, Rate-based, RobustMPC
- Learning-based: Pensieve (PPO), Merina (Meta-RL), PA-MoE
Results are saved as PDF files in the assert/ directory.
Modify src/configs/ppo_config.py to adjust hyperparameters:
@dataclass
class Config:
# Training
total_timesteps: int = 2_000_000
learning_rate: float = 1e-3
num_steps: int = 2000
update_epochs: int = 5
# PPO
clip_coef: float = 0.2
vf_coef: float = 5
ent_coef: float = 0.0
gamma: float = 0.99
gae_lambda: float = 0.95
# Plasticity
redo_tau: float = 0.5 # Dormant unit threshold
gradient_tau: float = 0.05 # Gradient thresholdPA-MoE demonstrates superior performance in handling QoE shifts:
- Better QoE: Achieves higher average QoE across different content types
- Robust Adaptation: Maintains performance when QoE preferences change
- Efficient Expert Utilization: Activates relevant experts for different content types
The evaluation generates comprehensive visualizations:
- CDF of QoE: Distribution of QoE values across sessions
- QoE Components: Breakdown of bitrate, rebuffering, and smoothness rewards
- Bitrate-Rebuffer Trade-off: Performance in bitrate vs rebuffering space
- Smoothness-Rebuffer Trade-off: Performance in smoothness vs rebuffering space
- Bitrate-Smoothness Trade-off: Performance in bitrate vs smoothness space
PA-MoE extends standard MoE with plasticity mechanisms:
- Noisy Top-K Router: Routes inputs to experts with pseudo-noise for exploration
- Expert Noise Injection: Adds small noise to expert parameters to maintain plasticity
- Dormant Unit Detection: Identifies and reactivates dormant units using ReDo (Recycled Dormant) mechanism
- Gradient-based Adaptation: Monitors gradient information for expert utilization
- Router: Noisy Top-K gating network that selects experts
- Experts: Specialized networks for different QoE regimes
- Policy Head: Outputs action probabilities
- Value Head: Estimates state values
- Plasticity Mechanisms: Expert noise injection and dormant unit reactivation enable adaptation to QoE shifts
- Content-Aware Routing: Router learns to select appropriate experts for different content types
- Stable Training: Pseudo-noise in router maintains PPO ratio stability while enabling exploration
If you use this code in your research, please cite:
@article{he2025plasticity,
title={Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming},
author={He, Zhiqiang and Liu, Zhi},
journal={arXiv preprint arXiv:2504.09906},
year={2025}
}Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- OptiVerse for the video streaming environment
- The open-source RL community for PPO implementations
For questions or issues, please open an issue on GitHub.
Note: This project is part of research on adaptive video streaming with reinforcement learning. For detailed experimental results and analysis, please refer to the paper (if available).



