Next-gen ACT for real-world robotics: self-correcting, long-memory, and voice-command enabled.
Standard ACT policies are great at imitation, but they "play back" actions without checking if they succeeded—like a robot that keeps moving even when the cloth slips. To address these limitations, we propose an enhanced ACT architecture that integrates feedback control, memory augmentation, and multimodal conditioning:
-
Closed-Loop Execution: We introduce a PD controller between the policy output and robot actuators, enabling real-time error correction and improved robustness against disturbances (e.g., cloth slippage).
-
Memory-Augmented Planning: An external memory module retains historical states and actions, allowing the policy to handle long-horizon tasks such as sequential folding or multi-item sorting.
-
Language-Conditioned Control: By fusing CLIP-encoded text commands (e.g., "Pick the red sock") into the context vector, we enable task-aware policy switching and goal specification without retraining.
This architecture bridges the gap between learned visuomotor policies and classical control theory, paving the way toward more reliable and versatile robotic systems.