Skip to content

AoqunJin/Awesome-VLA-Post-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

🔖 Awesome-VLA-Post-Training

Awesome-VLA-Post-Training is a continuously updated collection of cutting-edge resources focused on the post-training of VLA systems. As embodied AI experiences rapid growth, this repository serves as a centralized hub for research updates, practical codes, and implementation insights. Our goal is to enhance the ability of VLA agents to perceive, reason, and act within physical environments. Key focus areas include:

  • 🌏 Enhancing environmental perception
  • 🧠 Improving embodiment awareness
  • 📝 Deepening task comprehension and generalization
  • 🔧 Integrating and tuning multiple components

We welcome contributions from researchers and practitioners passionate about advancing VLA systems. Join us in building a structured, high-quality resource for the community!

  • [2025-6] Our paper, “Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends,” is now publicly available. (Paper)
  • [2025-10] 🔥 We updated to version v2, adding 50+ new papers and benchmarks for comparison.

⭐ Notable Works

This is a curated selection of influential papers, benchmarks and projects that have made a significant contribution to the field of VLA systems. These works provide foundational insights and state-of-the-art methods that inform current research directions.

  • [2022-12] RT-1: Robotics Transformer for real-world control at scale. (Paper, Website, Code)

  • [2023-07] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. (Paper, Website)

  • [2024-03] 3D-VLA: A 3D Vision-Language-Action Generative World Model. (Paper, Website, Code)

  • [2024-05] Octo: An Open-Source Generalist Robot Policy. (Paper, Website, Code)

  • [2024-06] OpenVLA: An Open-Source Vision-Language-Action Model. (Paper, Website, Code)

  • [2024-06] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation. (Paper, Website, Code)

  • [2024-10] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation. (Paper, Website, Code)

  • [2024-10] π0: A Vision-Language-Action Flow Model for General Robot Control. (Paper, Website, Code)

  • [2024-10] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. (Paper, Website)

  • [2024-11] CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation. (Paper, Website, Code)

  • [2025-03] Gemini Robotics: Bringing AI into the Physical World. (Paper, Website)

  • [2025-03] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. (Paper, Website, Code)

  • [2025-03] AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems. (Paper, Website, Code)

  • [2025-04] π0.5: a Vision-Language-Action Model with Open-World Generalization. (Paper, Website)

  • [2025-05] UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. (Paper, Code)

  • [2025-05] GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data. (Paper, Website, Code)

  • [2025-06] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. (Paper, Code)

  • [2025-07] GR-3 Technical Report. (Paper, Website)


📐 Benchmark

LIBERO and Calvin, two widely used simulation environments, along with several validated Vision-Language-Action models, are summarized below.

LIBERO

LIBERO

Calvin

Calvin


🌏 Enhancing Environmental Perception

This section explores methods that improve an agent’s ability to perceive and interpret its environment. It includes affordance-guided learning, which enables agents to understand actionable properties of objects; enhanced encoders tailored for manipulation tasks, allowing more precise feature extraction; and improved representation learning, which helps models build richer and more structured environmental understanding for downstream tasks.

Affordance-Guided Learning

  • [2024-01] Object-Centric Instruction Augmentation for Robotic Manipulation. (Paper)

  • [2024-02] RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis. (Paper, Website)

  • [2024-03] RT-H: Action Hierarchies Using Language. (Paper, Website)

  • [2024-06] A3VLM: Actionable Articulation-Aware Vision Language Model. (Paper, Code)

  • [2024-06] RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics. (Paper, Website, Code)

  • [2024-11] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation. (Paper, Website)

  • [2024-12] Improving Vision-Language-Action Models via Chain-of-Affordance. (Paper)

  • [2025-01] OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints. (Paper, Website, Code)

  • [2025-04] RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics. (Paper)

  • [2025-04] A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation. (Paper)

  • [2025-04] ControlManip: Few-Shot Manipulation Fine-tuning via Object-centric Conditional Control. (Paper)

  • [2025-07] VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models. (Paper)

  • [2025-07] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation. (Paper, Website, Code)

Enhanced Encoder for Manipulation

  • [2024-02] Task-conditioned adaptation of visual features in multi-task policy learning. (Paper, Website, Code)

  • [2024-03] Never-Ending Behavior-Cloning Agent for Robotic Manipulation. (Paper, Website)

  • [2024-06] Learning Efficient and Robust Language-conditioned Manipulation using Textual-Visual Relevancy and Equivariant Language Mapping. (Paper, Website, Code)

  • [2024-07] Theia: Distilling Diverse Vision Foundation Models for Robot Learning. (Paper, Website, Code)

  • [2024-09] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation. (Paper, Website, Code)

  • [2024-10] M2Distill: Multi-Modal Distillation for Lifelong Imitation Learning. (Paper)

  • [2024-10] VIRT: Vision Instructed Transformer for Robotic Manipulation. (Paper)

  • [2024-11] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics. (Paper, Website, Code)

  • [2025-02] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model. (Paper)

  • [2025-03] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation. (Paper)

  • [2025-03] A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning. (Paper, Code)

  • [2025-05] InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning. (Paper, Website)

  • [2025-05] ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge. (Paper, Website)

  • [2025-05] Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions. (Paper)

  • [2025-06] CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding. (Paper, Website, Code)

  • [2025-08] GeoVLA: Empowering 3D Representations in Vision-Language-Action Models. (Paper, Website)

Enhanced Representation for Manipulation

  • [2024-02] Vision-Language Models Provide Promptable Representations for Reinforcement Learning. (Paper, Website, Code)

  • [2024-03] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics. (Paper)

  • [2024-05] Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control. (Paper)

  • [2024-12] TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies. (Paper, Website, Code)

  • [2025-01] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model. (Paper, Website, Code)

  • [2025-02] BFA: Best-Feature-Aware Fusion for Multi-View Fine-grained Manipulation. (Paper)

  • [2025-02] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation. (Paper, Website, Code)

  • [2025-02] VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation. (Paper)

  • [2025-02] ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration. (Paper, Website)

  • [2025-02] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping. (Paper, Website, Code)

  • [2025-03] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction. (Paper, Website, Code)

  • [2025-03] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation. (Paper)

  • [2025-05] VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation. (Paper, Website)

  • [2025-05] 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks. (Paper, Website, Code)

  • [2025-05] ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation. (Paper, Website)

  • [2025-06] BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models. (Paper, Website, Code)

  • [2025-06] CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation. (Paper, Website, Code)

  • [2025-06] OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation. (Paper, Website)

  • [2025-07] Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding. (Paper)

  • [2025-07] VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback. (Paper, Website, Code)

  • [2025-08] GeoVLA: Empowering 3D Representations in Vision-Language-Action Models. (Paper, Website)

  • [2025-08] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation. (Paper, Website, Code)

  • [2025-08] RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models. (Paper, Website, Code)

  • [2025-08] Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding. (Paper, Website)

  • [2025-08] OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing. (Paper, Website)


🧠 Improving Embodiment Awareness

Here we focus on helping agents better understand their own physical structure and capabilities. Topics include forward and inverse kinematics learning, which allow agents to model the relationship between joint movements and spatial positions, and action head design, aimed at optimizing how high-level decisions are translated into low-level motor commands.

Forward and Inverse kinematics learning

  • [2023-10] Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning. (Paper, Website, Code)

  • [2024-10] Effective Tuning Strategies for Generalist Robot Manipulation Policies. (Paper)

  • [2024-12] Learning Novel Skills from Language-Generated Demonstrations. (Paper, Website, Code)

  • [2025-02] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation. (Paper, Website, Code)

  • [2025-05] LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning. (Paper, Website, Code)

Action Head Designing

  • [2023-10] TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models. (Paper)

  • [2024-05] FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies. (Paper)

  • [2024-06] Grounding Multimodal Large Language Models in Actions. (Paper)

  • [2024-08] Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling. (Paper, Website, Code)

  • [2024-09] Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. (Paper)

  • [2024-10] Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand. (Paper)

  • [2024-12] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression. (Paper, Website)

  • [2025-01] FAST: Efficient Action Tokenization for Vision-Language-Action Models. (Paper, Website)

  • [2025-01] Universal Actions for Enhanced Embodied Foundation Models. (Paper, Website, Code)

  • [2025-02] Fine-tuning vision-language-action models: Optimizing speed and success. (Paper, Website, Code)

  • [2025-03] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding. (Paper)

  • [2025-03] Refined Policy Distillation: From VLA Generalists to RL Experts. (Paper)

  • [2025-03] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model. (Paper, Website, Code)

  • [2025-03] Efficient Continual Adaptation of Pretrained Robotic Policy with Online Meta-Learned Adapters. (Paper)

  • [2025-03] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy. (Paper, Website, Code)

  • [2025-07] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting. (Paper, Code)

  • [2025-07] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers. (Paper, Website, Code)

  • [2025-08] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies. (Paper)


📝 Deepening Task Comprehension

This section covers methods that enable agents to better understand and generalize across tasks. Key areas include human–robot interaction, where agents learn to interpret and respond to human inputs effectively, and hierarchical task manipulation, which enables multi-step reasoning and planning by decomposing complex tasks into structured subtasks.

Human–Robot-Interaction

  • [2023-10] What Matters to You? Towards Visual Representation Alignment for Robot Learning. (Paper)

  • [2024-05] Hummer: Towards Limited Competitive Preference Dataset. (Paper)

  • [2024-05] A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation. (Paper)

  • [2024-12] Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment. (Paper)

  • [2025-03] Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning. (Paper)

  • [2025-03] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning. (Paper, Website)

  • [2025-03] RoboCopilot: Human-in-the-loop Interactive Imitation Learning for Robot Manipulation. (Paper)

  • [2025-04] Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction. (Paper, Website, Code)

  • [2025-06] Robotic Policy Learning via Human-assisted Action Preference Optimization. (Paper, Website, Code)

Hierarchical Task Manipulation

  • [2023-11] Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning. (Paper, Website, Code)

  • [2024-07] Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning. (Paper)

  • [2024-07] Robotic Control via Embodied Chain-of-Thought Reasoning. (Paper, Website, Code)

  • [2024-08] Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation. (Paper, Website, Code)

  • [2024-10] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers. (Paper)

  • [2024-11] STEER: Flexible Robotic Manipulation via Dense Language Grounding. (Paper, Website)

  • [2024-11] GRAPE: Generalizing Robot Policy via Preference Alignment. (Paper, Website, Code)

  • [2024-11] CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision. (Paper, Website, Code)

  • [2024-12] Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning. (Paper, Website, Code)

  • [2024-12] RoboMatrix: A Skill-centric Hierarchical Framework for Scalable Robot Task Planning and Execution in Open-World. (Paper, Website, Code)

  • [2025-02] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete. (Paper, Website, Code)

  • [2025-02] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models. (Paper, Website)

  • [2025-03] RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation. (Paper, Website, Code)

  • [2025-03] DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data. (Paper)

  • [2025-05] LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning. (Paper, Website, Code)

  • [2025-05] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning. (Paper, Website, Code)

  • [2025-05] Pre-Trained Multi-Goal Transformers with Prompt Optimization for Efficient Online Adaptation. (Paper)

  • [2025-05] Training Strategies for Efficient Embodied Reasoning. (Paper, Website)

  • [2025-06] Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse. (Paper)

  • [2025-07] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning. (Paper, Website)


🔧 Multiple Component Integration

Integrating various subsystems is essential for building robust VLA agents. This section includes reinforcement learning frameworks for continuous control and decision-making, visual interaction prediction for anticipating future outcomes based on perception, and strategies for active dataset processing to reduce the cost of adapting models to new environments or tasks.

Reinforcement Learning

  • [2023-10] Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning. (Paper, Website, Code)

  • [2023-12] LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers. (Paper)

  • [2024-01] Building Open-Ended Embodied Agent via Language-Policy Bidirectional Adaptation. (Paper, Code)

  • [2024-01] Improving Vision-Language-Action Model with Online Reinforcement Learning. (Paper)

  • [2024-01] Vintix: Action Model via In-Context Reinforcement Learning. (Paper, Website, Code)

  • [2024-02] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy. (Paper, Website, Code)

  • [2024-02] A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards. (Paper, Website, Code)

  • [2024-02] Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum. (Paper)

  • [2024-02] Offline Actor-Critic Reinforcement Learning Scales to Large Models. (Paper)

  • [2024-05] PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning. (Paper, Website, Code)

  • [2024-07] Affordance-Guided Reinforcement Learning via Visual Prompting. (Paper)

  • [2024-09] FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning. (Paper, Website, Code)

  • [2024-09] Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving. (Paper)

  • [2024-09] Lifelong Autonomous Improvement of Navigation Foundation Models in the Wild. (Paper, Website, Code)

  • [2024-10] Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance. (Paper, Website, Code)

  • [2024-10] GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance. (Paper)

  • [2024-12] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone. (Paper, Website, Code)

  • [2024-12] Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model . (Paper, Website, Code)

  • [2024-12] RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning. (Paper, Website, Code)

  • [2025-01] FDPP: Fine-tune Diffusion Policy with Human Preference. (Paper)

  • [2025-03] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning. (Paper, Website, Code)

  • [2025-05] Lifelong Autonomous Improvement of Navigation Foundation Models in the Wild. (Paper, Code)

  • [2025-05] What Can RL Bring to VLA Generalization? An Empirical Study. (Paper, Website, Code)

  • [2025-05] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning. (Paper, Code)

  • [2025-05] Interactive Post-Training for Vision-Language-Action Models. (Paper, Website, Code)

  • [2025-05] ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning. (Paper, Code)

  • [2025-05] RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback. (Paper)

  • [2025-06] Inference-Time Alignment via Hypothesis Reweighting. (Paper)

  • [2025-06] Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics. (Paper)

  • [2025-06] TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization. (Paper)

  • [2025-07] Behavioral Exploration: Learning to Explore via In-Context Adaptation. (Paper)

  • [2025-07] Reinforcement Learning with Action Chunking. (Paper)

  • [2025-08] CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning. (Paper)

  • [2025-09] VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning. (Paper)

Visual Interaction Prediction

  • [2023-12] Unleashing large-scale video generative pre-training for visual robot manipulation. (Paper, Website, Code)

  • [2024-03] MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control. (Paper, Website, Code)

  • [2024-06] Learning Manipulation by Predicting Interaction. (Paper, Website, Code)

  • [2024-06] DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting. (Paper, Website)

  • [2024-06] Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts. (Paper, Website)

  • [2024-07] VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation. (Paper, Code)

  • [2024-07] Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. (Paper, Website, Code)

  • [2024-07] Generative Image as Action Models. (Paper, Website, Code)

  • [2024-08] GR-MG: Leveraging Partially-Annotated Data Via Multi-Modal Goal-Conditioned Policy. (Paper, Website, Code)

  • [2024-10] Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust. (Paper, Website, Code)

  • [2024-10] VIP: Vision Instructed Pre-training for Robotic Manipulation. (Paper, Website, Code)

  • [2024-10] Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. (Paper, Website)

  • [2024-12] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations. (Paper, Website, Code)

  • [2024-12] Moto: Latent motion token as the bridging language for robot manipulation. (Paper, Website, Code)

  • [2024-12] Predictive inverse dynamics models are scalable learners for robotic manipulation. (Paper, Website, Code)

  • [2025-01] UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent. (Paper)

  • [2025-02] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation. (Paper, Website, Code)

  • [2025-03] Unified Video Action Model. (Paper, Website, Code)

  • [2025-03] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models. (Paper, Website)

  • [2025-03] DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation. (Paper, Website, Code)

  • [2025-04] Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation. (Paper, Code)

  • [2025-05] FLARE: Robot Learning with Implicit World Modeling. (Paper, Website, Code)

  • [2025-06] WorldVLA: Towards Autoregressive Action World Model. (Paper, Code)

  • [2025-07] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. (Paper, Website, Code)

  • [2025-07] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos. (Paper, Website)

Active Dataset Processing

  • [2023-11] RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation. (Paper, Website, Code)

  • [2024-01] SWBT: Similarity Weighted Behavior Transformer with the Imperfect Demonstration for Robotic Manipulation. (Paper)

  • [2024-02] Transductive Active Learning: Theory and Applications. (Paper, Code)

  • [2024-03] Efficient Data Collection for Robotic Manipulation via Compositional Generalization. (Paper)

  • [2024-06] RVT-2: Learning Precise Manipulation from Few Demonstrations. (Paper, Website, Code)

  • [2024-07] Autonomous Improvement of Instruction Following Skills via Foundation Models. (Paper, Website, Code)

  • [2024-10] Active Fine-Tuning of Generalist Policies. (Paper)

  • [2024-10] Data Scaling Laws in Imitation Learning for Robotic Manipulation. (Paper, Website, Code)

  • [2025-02] RoboBERT: An End-to-end Multimodal Robotic Manipulation Model. (Paper)

  • [2025-02] DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control. (Paper, Website, Code)

  • [2025-02] DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning. (Paper, Website, Code)

  • [2025-03] DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data. (Paper)


📋 Survey

  • [2023-12] Foundation Models in Robotics: Applications, Challenges, and the Future. (Paper, Code)

  • [2023-12] Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. (Paper)

  • [2024-02] Real-World Robot Applications of Foundation Models: A Review. (Paper)

  • [2024-05] A Survey on Vision-Language-Action Models for Embodied AI. (Paper, Code)

  • [2025-03] A Taxonomy for Evaluating Generalist Robot Policies. (Paper)

  • [2025-05] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges. (Paper, Website)

  • [2025-07] A Survey on Vision-Language-Action Models: An Action Tokenization Perspective. (Paper, Code)


✒️ Contributing

We welcome contributions from the community! Whether it's adding new papers, sharing code, or improving documentation, your input helps make this a valuable resource for everyone!


📌 BibTeX

To cite this repository in your research, please use the following BibTeX entry:

@article{xiang2025parallels,
  title={Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends},
  author={Xiang, Tian-Yu and Jin, Ao-Qun and Zhou, Xiao-Hu and Gui, Mei-Jiang and Xie, Xiao-Liang and Liu, Shi-Qi and Wang, Shuang-Yi and Duan, Sheng-Bin and Xie, Fu-Chao and Wang, Wen-Kai and others},
  journal={arXiv preprint arXiv:2506.20966},
  year={2025}
}

About

A collection of vision-language-action model post-training methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5