Skip to content

Conversation

@opooladz
Copy link

@opooladz opooladz commented Oct 4, 2025

No description provided.

opooladz and others added 5 commits October 3, 2025 21:19
…stack

Enhance GRPOTrainer with advanced RL features:
- Add asymmetric clipping, per-token weighting, length shaping
- Support multiple advantage estimators (mean, median, GAE)
- Implement KL/entropy guards and reference model resets
- Add group-normalized advantages and DAPO length shaping

Improve DPO metrics logging:
- Add KL divergence, entropy, and logit margin tracking
- Ensure consistent dashboard metrics across trainers

Documentation updates:
- Add RL_ALGORITHMS.md documenting all RL algorithms
- Add INFERENCE_ENGINES.md documenting vInference/vSurge/vWhisper
- Add RL_UPDATES.md with release notes
- Add VINEPPO_IMPLEMENTATION_PLAN.md for future work
- Update docs/trainers/grpo.md with new configuration options

Testing:
- Add tests/trainers/grpo_utils_test.py for utility functions
Add complete image diffusion stack to EasyDeL with 4 architectures, 2 trainers,
and MoE-based scaling following DeepSeek V2 patterns.

## New Modules (8,900+ lines)

### Architectures (4 models)
- **DiT**: Diffusion Transformer with adaptive LayerNorm (879 lines)
- **DiT-MoE**: Sparse MoE DiT with 64 routed + 2 shared experts (1,116 lines)
- **VAE**: Variational autoencoder for latent diffusion (1,189 lines)
- **UNet 2D**: Stable Diffusion UNet with cross-attention (2,186 lines)
- **Flux**: State-of-the-art transformer with RoPE (1,353 lines)

### Trainers (2 implementations)
- **Image Diffusion Trainer**: Rectified flow with velocity prediction (442 lines)
- **Stable Diffusion Trainer**: Full SD pipeline with VAE + text (1,343 lines)

## Key Features

### DiT-MoE (New!)
- Mixture of Experts following DeepSeek V2 architecture
- 64 routed experts + 2 shared experts (configurable)
- Top-k routing without auxiliary losses
- Expert parallelism support via ExpertColumnWiseAlt sharding
- 3x parameters with same compute as dense DiT

### Rectified Flow
- Velocity prediction formulation: v = data - noise
- Straight ODE paths for fast sampling
- Min-SNR gamma weighting (γ=5.0) for training stability
- Compatible with DDPM/DDIM schedulers

### Production Ready
- Full Flax nnx implementation with EasyDeLBaseModule
- @register_module and @register_config decorators
- Partition rules for distributed training
- Gradient checkpointing support

## Documentation
- DIT_MOE_README.md: Complete MoE-DiT guide (524 lines)
- DIFFUSION_COMPLETE_SUMMARY.md: Architecture overview (462 lines)
- IMAGE_DIFFUSION_README.md: DiT training guide (369 lines)
- examples/train_image_diffusion_dit.py: Training example (147 lines)

## Registry Updates
- easydel/modules/__init__.py: Added dit, dit_moe, flux, unet2d, vae
- easydel/trainers/__init__.py: Added image_diffusion and stable_diffusion trainers

Total: 10,177 lines added across 32 files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Upgrade DiT-MoE to use DeepSeek V3's superior MoE design with major improvements:

## V3 Improvements over V2

### Expert Scaling (4x more experts)
- **256 routed experts** (vs 64 in V2)
- **1 shared expert** (vs 2 in V2) - more capacity in routed experts
- **8 experts per token** (vs 6 in V2) for better quality

### Routing Innovations
- **Sigmoid scoring** (vs softmax in V2) for better expert utilization
- **Token-choice routing** (`noaux_tc`) - tokens choose experts naturally
- **Group-limited routing**: 8 expert groups with top-4 selection
- **Higher scaling factor**: 2.5 (vs 1.0) for stronger expert contributions
- **Normalized top-k probabilities** for balanced load

### Performance Impact
- **3.1% sparsity**: Only 9/257 experts active (vs 12.1% in V2)
- **Better load balancing** through group-limited routing
- **No auxiliary losses** - V3's natural balance eliminates need for router losses

## Changes

- easydel/modules/dit_moe/dit_moe_configuration.py: Update defaults to V3
- easydel/modules/dit_moe/modeling_dit_moe.py: Add sigmoid scoring + noaux_tc routing
- DIT_MOE_README.md: Update documentation to reflect V3 architecture

Total experts: 1 shared + 256 routed = 257 experts
Active per token: 1 shared + 8 routed = 9 experts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Triton (from eformer) only has wheels for Linux x86_64. Use platform markers
to auto-detect and conditionally install Triton based on the OS.

## Auto-Detection

```toml
override-dependencies = [
  "triton>=3.0.0; platform_system == 'Linux' and platform_machine == 'x86_64'",
]
```

## Behavior

- **Linux x86_64** (CUDA GPUs) → ✅ Triton installed
- **macOS** (Apple Silicon/Intel) → ⏭️  Triton skipped
- **Windows** → ⏭️  Triton skipped
- **Linux ARM** (Raspberry Pi, etc.) → ⏭️  Triton skipped

## Installation

Just run `uv sync` on any platform - it auto-detects:
```bash
uv sync  # Works on Mac, Linux, Windows
```

No manual platform selection needed!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive image diffusion documentation showcasing new capabilities:

## Image Diffusion Section

### DiT-MoE Example
- Training with 256 experts (DeepSeek V3 architecture)
- Rectified flow with velocity prediction
- Complete configuration example

### Stable Diffusion Example
- Text-to-image training with frozen CLIP
- VAE + UNet2D pipeline
- SNR weighting configuration

### Supported Architectures
- DiT: Patch-based transformer with adaptive LayerNorm
- DiT-MoE: Sparse MoE (256 experts, 3.1% sparsity)
- UNet2D: Classic SD with cross-attention
- Flux: State-of-the-art with RoPE
- VAE: Latent encoder/decoder (SD 1.x/2.x/SDXL)

### Key Features
- Rectified Flow with straight ODE paths
- Min-SNR weighting (γ=5.0) for stability
- Expert parallelism for distributed training
- Mixed precision (bfloat16/float16)

## Key Features Updates
- Listed 55+ models by category (LLMs, SSMs, Vision, Multimodal, MoE)
- Added Image Diffusion and Stable Diffusion trainers
- Highlighted 12 DPO algorithms in trainer list

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant