A curated list of papers, datasets, and tools for testing Embodied AI, Robotics, and Autonomous Systems using Large Models (LLMs/VLMs), organized from a Software Engineering (SE) perspective.
As Embodied AI (e.g., Manipulators, Humanoid Robots) and Autonomous Systems (e.g., ADVs, UAVs) become increasingly complex, ensuring their quality and safety is paramount.
This repository bridges the gap between AI and Software Engineering. It collects resources focusing on:
- Test Generation: Transforming natural language/logs into simulation tasks or fuzzing inputs.
- Test Oracles: Using LLMs/VLMs as judges for failure detection and semantic anomaly reasoning.
- Automated Repair: Self-evolving systems, code repair, and policy rectification.
- Infrastructure: Benchmarks, Sim-to-Real frameworks, and datasets.
- 1. Core: Embodied AI & Robotics Focus
- 2. Methodology Transfer: Autonomous Driving & UAVs
- 3. Test Levels & QA
- 4. Infrastructure
Research directly targeting Robotic Manipulation, Navigation, and Embodied Agents.
Using LLMs to generate valid task descriptions, scene graphs, or simulation environments.
- [CoRL '24] GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs
- [ICLR '24] GenSim: Generating Robotic Simulation Tasks via Large Language Models
- [CVPR '25] GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation
- [CVPR '25] GRS: Generating Robotic Simulation Tasks from Real-World Images
- [HRI '25] RCareGen: An Interface for Scene and Task Generation in RCareWorld
- [arXiv '25] RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization
- [LRA '24] Indoor and Outdoor 3D Scene Graph Generation Via Language-Enabled Spatial Ontologies
- [arXiv '24] SocRATES: Towards Automated Scenario-based Testing of Social Navigation Algorithms
Using VLMs/LLMs as "Test Oracles" to determine pass/fail or explain root causes.
- [ICLR '25] AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- [IROS '24] DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment
- [arXiv '25] RAIDER: Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection
- [CoRL '23] REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction
- [RSS '24] Real-Time Anomaly Detection and Reactive Planning with Large Language Models
- [arXiv '25] Evaluating Robot Policies in a World Model
Applying Automated Program Repair (APR) concepts to robotic policies and plans.
- [IROS '24] Recover: A Neuro-Symbolic Framework for Failure Detection and Recovery
- [arXiv '25] INPROVF: Leveraging Large Language Models to Repair High-level Robot Controllers
- [arXiv '24] Creating and Repairing Robot Programs in Open-World Domains (RoboRepair)
- [ICASSP '25] Interactive Robot Action Replanning using Multimodal LLM Trained from Human Demonstration Videos
Mature testing methodologies from ADV/UAV that can be transferred to Embodied AI.
Translating natural language requirements into formal test cases (DSL, Scenic, OpenSCENARIO).
- [CVPR '24] ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation
- [TSMC '24] LLMScenario: Large Language Model Driven Scenario Generation
- [TSE '25] TARGET: Traffic Rule-Based Test Generation via Validated LLM
- [arXiv '24] Generating Probabilistic Scenario Programs from Natural Language (ScenicNL)
- [arXiv '24] ChatSUMO: Automating Traffic Scenario Generation in Simulation of Urban Mobility
- [ASE '22] ADEPT: A Testing Platform for Simulated Autonomous Driving
- [arXiv '25] Text2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test
- [arXiv '24] Generating Driving Simulations via Conversation
Fuzz testing and searching for corner cases/long-tail distribution.
- [arXiv '25] From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation
- [arXiv '25] LLM-attacker: Enhancing Closed-loop Adversarial Scenario Generation
- [arXiv '24] Realistic Corner Case Generation for Autonomous Vehicles with Multimodal LLM
- [ASE '24] LeGEND: A Top-Down Approach to Scenario Generation
- [arXiv '24] LMM-enhanced Safety-Critical Scenario Generation from Non-Accident Traffic Videos
- [AAAI '25] An LLM-Empowered Adaptive Evolutionary Algorithm for Multi-Component Deep Learning Systems
- [arXiv '24] SAFLiTe: Fuzzing Autonomous Systems via Large Language Models (UAV Focus)
Using Domain Knowledge Bases to enhance test validity.
- [arXiv '25] Driving-RAG: Driving Scenarios Embedding, Search, and RAG Applications
- [arXiv '25] Seeking to Collide: Online Safety-Critical Scenario Generation with Retrieval Augmented LLMs
- [Sci. Rep. '25] An intelligent guided troubleshooting method for aircraft based on HybirdRAG
- [ERTS '24] Test Suite Augmentation using Language Models - Applying RAG to Improve Robustness Verification
Using multiple LLM agents (Attacker, Defender, Judge) for system-level testing.
- [ASE '24] SoVAR: Build Generalizable Scenarios from Accident Reports
- [CVPR '24] Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents (ChatSim)
- [arXiv '24] ChatDyn: Language-Driven Multi-Actor Dynamics Generation
- [arXiv '25] CrashAgent: Crash Scenario Generation via Multi-modal Reasoning
- [arXiv '24] Multimodal Large Language Model Driven Scenario Testing (OmniTester)
Resources categorized by Software Engineering testing hierarchy.
- [arXiv '24] Automated Control Logic Test Case Generation using Large Language Models (PLC/Industrial)
- [arXiv '25] Fine-grained Testing for Autonomous Driving Software: a Study on Autoware with LLM-driven Unit Testing
- [TSE '25] MirrorFuzz: Leveraging LLM and Shared Bugs for Deep Learning Framework APIs Fuzzing
- [arXiv '25] Interleaving Large Language Models for Compiler Testing
- [ISSTA '24] DiaVio: LLM-Empowered Diagnosis of Safety Violations in ADS Simulation Testing
- [arXiv '23] Semantic Anomaly Detection with Large Language Models
- [IROS '24] ODD-diLLMma: Driving Automation System ODD Compliance Checking using LLMs
- [arXiv '25] FixDrive: Automatically Repairing Autonomous Vehicle Driving Behaviour
- [arXiv '25] From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving
- [LRA '24] "Don't Forget to Put the Milk Back!" Dataset for Enabling Embodied Agents to Detect Anomalous Situations
- [arXiv '24] CODA-LM: Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
- [CVPRW '23] WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
- [ACM TOSEM '23] Testing, Validation, and Verification of Robotic and Autonomous Systems: A Systematic Review
- [arXiv '25] Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
- [arXiv '25] Generative AI for Testing of Autonomous Driving Systems: A Survey
- [TSE '24] Software Testing With Large Language Models: Survey, Landscape, and Vision
Contributions are welcome! If you find a relevant paper (especially focusing on Software Engineering aspects of Embodied AI testing) that is missing, please submit a Pull Request.
Format:
- [Venue 'YY] Paper Title [[Paper](link)] [[Code](link)]
If you have any questions or suggestions, feel free to open an issue or contact the maintainer.