-
Notifications
You must be signed in to change notification settings - Fork 44
Feature/image diffusion #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
opooladz
wants to merge
5
commits into
erfanzar:main
Choose a base branch
from
opooladz:feature/image-diffusion
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…stack Enhance GRPOTrainer with advanced RL features: - Add asymmetric clipping, per-token weighting, length shaping - Support multiple advantage estimators (mean, median, GAE) - Implement KL/entropy guards and reference model resets - Add group-normalized advantages and DAPO length shaping Improve DPO metrics logging: - Add KL divergence, entropy, and logit margin tracking - Ensure consistent dashboard metrics across trainers Documentation updates: - Add RL_ALGORITHMS.md documenting all RL algorithms - Add INFERENCE_ENGINES.md documenting vInference/vSurge/vWhisper - Add RL_UPDATES.md with release notes - Add VINEPPO_IMPLEMENTATION_PLAN.md for future work - Update docs/trainers/grpo.md with new configuration options Testing: - Add tests/trainers/grpo_utils_test.py for utility functions
Add complete image diffusion stack to EasyDeL with 4 architectures, 2 trainers, and MoE-based scaling following DeepSeek V2 patterns. ## New Modules (8,900+ lines) ### Architectures (4 models) - **DiT**: Diffusion Transformer with adaptive LayerNorm (879 lines) - **DiT-MoE**: Sparse MoE DiT with 64 routed + 2 shared experts (1,116 lines) - **VAE**: Variational autoencoder for latent diffusion (1,189 lines) - **UNet 2D**: Stable Diffusion UNet with cross-attention (2,186 lines) - **Flux**: State-of-the-art transformer with RoPE (1,353 lines) ### Trainers (2 implementations) - **Image Diffusion Trainer**: Rectified flow with velocity prediction (442 lines) - **Stable Diffusion Trainer**: Full SD pipeline with VAE + text (1,343 lines) ## Key Features ### DiT-MoE (New!) - Mixture of Experts following DeepSeek V2 architecture - 64 routed experts + 2 shared experts (configurable) - Top-k routing without auxiliary losses - Expert parallelism support via ExpertColumnWiseAlt sharding - 3x parameters with same compute as dense DiT ### Rectified Flow - Velocity prediction formulation: v = data - noise - Straight ODE paths for fast sampling - Min-SNR gamma weighting (γ=5.0) for training stability - Compatible with DDPM/DDIM schedulers ### Production Ready - Full Flax nnx implementation with EasyDeLBaseModule - @register_module and @register_config decorators - Partition rules for distributed training - Gradient checkpointing support ## Documentation - DIT_MOE_README.md: Complete MoE-DiT guide (524 lines) - DIFFUSION_COMPLETE_SUMMARY.md: Architecture overview (462 lines) - IMAGE_DIFFUSION_README.md: DiT training guide (369 lines) - examples/train_image_diffusion_dit.py: Training example (147 lines) ## Registry Updates - easydel/modules/__init__.py: Added dit, dit_moe, flux, unet2d, vae - easydel/trainers/__init__.py: Added image_diffusion and stable_diffusion trainers Total: 10,177 lines added across 32 files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Upgrade DiT-MoE to use DeepSeek V3's superior MoE design with major improvements: ## V3 Improvements over V2 ### Expert Scaling (4x more experts) - **256 routed experts** (vs 64 in V2) - **1 shared expert** (vs 2 in V2) - more capacity in routed experts - **8 experts per token** (vs 6 in V2) for better quality ### Routing Innovations - **Sigmoid scoring** (vs softmax in V2) for better expert utilization - **Token-choice routing** (`noaux_tc`) - tokens choose experts naturally - **Group-limited routing**: 8 expert groups with top-4 selection - **Higher scaling factor**: 2.5 (vs 1.0) for stronger expert contributions - **Normalized top-k probabilities** for balanced load ### Performance Impact - **3.1% sparsity**: Only 9/257 experts active (vs 12.1% in V2) - **Better load balancing** through group-limited routing - **No auxiliary losses** - V3's natural balance eliminates need for router losses ## Changes - easydel/modules/dit_moe/dit_moe_configuration.py: Update defaults to V3 - easydel/modules/dit_moe/modeling_dit_moe.py: Add sigmoid scoring + noaux_tc routing - DIT_MOE_README.md: Update documentation to reflect V3 architecture Total experts: 1 shared + 256 routed = 257 experts Active per token: 1 shared + 8 routed = 9 experts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Triton (from eformer) only has wheels for Linux x86_64. Use platform markers to auto-detect and conditionally install Triton based on the OS. ## Auto-Detection ```toml override-dependencies = [ "triton>=3.0.0; platform_system == 'Linux' and platform_machine == 'x86_64'", ] ``` ## Behavior - **Linux x86_64** (CUDA GPUs) → ✅ Triton installed - **macOS** (Apple Silicon/Intel) → ⏭️ Triton skipped - **Windows** → ⏭️ Triton skipped - **Linux ARM** (Raspberry Pi, etc.) → ⏭️ Triton skipped ## Installation Just run `uv sync` on any platform - it auto-detects: ```bash uv sync # Works on Mac, Linux, Windows ``` No manual platform selection needed! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive image diffusion documentation showcasing new capabilities: ## Image Diffusion Section ### DiT-MoE Example - Training with 256 experts (DeepSeek V3 architecture) - Rectified flow with velocity prediction - Complete configuration example ### Stable Diffusion Example - Text-to-image training with frozen CLIP - VAE + UNet2D pipeline - SNR weighting configuration ### Supported Architectures - DiT: Patch-based transformer with adaptive LayerNorm - DiT-MoE: Sparse MoE (256 experts, 3.1% sparsity) - UNet2D: Classic SD with cross-attention - Flux: State-of-the-art with RoPE - VAE: Latent encoder/decoder (SD 1.x/2.x/SDXL) ### Key Features - Rectified Flow with straight ODE paths - Min-SNR weighting (γ=5.0) for stability - Expert parallelism for distributed training - Mixed precision (bfloat16/float16) ## Key Features Updates - Listed 55+ models by category (LLMs, SSMs, Vision, Multimodal, MoE) - Added Image Diffusion and Stable Diffusion trainers - Highlighted 12 DPO algorithms in trainer list 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.