Wan 2.7 represents the cutting-edge evolution in Alibaba Cloud's video generation series, succeeding the professional-grade Wan 2.6.
Building upon the "Reference-to-Video" (R2V) and native audio-visual synchronization of its predecessor, Wan 2.7 pushes the boundaries of cinematic AI, long-form storytelling, and visual consistency.
Wan 2.6 was a major milestone that moved beyond simple clips to professional-grade production:
- Reference-to-Video (R2V): Enabled "clone-level" character consistency by uploading a reference video, maintaining appearance and motion patterns.
- Smart Multi-Shot Storytelling: Automated the generation of multi-angle scenes (panoramic, close-ups, tracking shots) within a single 15-second 1080p clip.
- Native AV Sync: First in the series to offer native lip-syncing and synchronized environmental sound effects (SFX) that match on-screen motion.
- Dual-Subject Interaction: Supported high-fidelity interaction between up to two distinct characters.
Wan 2.7 refines the core architecture of 2.6, focusing on scaling, endurance, and granular control:
While Wan 2.6 capped at 15 seconds, Wan 2.7 is engineered for extended durations, targeting 30 to 60 seconds of continuous, high-fidelity video in a single generation pass without losing temporal coherence.
Beyond 2.6's R2V, Wan 2.7 introduces Persistent Character VAEs, allowing the model to "memorize" character features across an entire project. This enables full-length AI filmmaking with a consistent cast across separate scenes.
Wan 2.7 moves from simple generation to dynamic editing. It supports natural language instructions to modify existing footage (e.g., "Change the lighting to sunset" or "Add a futuristic helmet to the character"), preserving the original motion and structure while altering specific elements.
The updated Wan-VAE v4 significantly improves the simulation of complex physics, including realistic fluid dynamics (water, fire), cloth physics, and high-speed motion blur without the "warping" artifacts seen in earlier diffusion models.
Succeeding the basic SFX of 2.6, Wan 2.7 generates Spatial Binaural Audio that dynamically shifts based on the camera's position and the movement of objects in the scene.
| Feature | Wan 2.6 | Wan 2.7 (Target) |
|---|---|---|
| Max Duration | 15 Seconds | 30 - 60 Seconds |
| Resolution | 1080p @ 24fps | 1080p / 4K Upscaling |
| Consistency | R2V (Reference-based) | Global Persistent Embeddings |
| Audio | Native Lip-Sync & SFX | Full Binaural Spatial Audio |
| Editing | Static Inpainting | Dynamic Instruction-based Editing |
| License | Apache 2.0 (Open Weights) | Apache 2.0 (Open Weights) |
- AI-Driven Cinema: Creating consistent, long-form narrative content with a persistent cast.
- Professional Marketing: Generating 4K-ready advertisements with perfect product and character consistency.
- Social Media Storytelling: Rapid production of multi-shot, audio-synced content for TikTok and YouTube.
- Immersive Education: Visualizing historical or scientific events with spatial audio and realistic physics.
Disclaimer: Wan 2.7 is the latest iteration of the Wan series by Alibaba Cloud. This repository serves as a knowledge hub and documentation for the model's capabilities and implementation.