Paper: Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning (ICML 2025) [link]
Large Language Models (LLMs) face several fundamental challenges in reasoning and decision-making. Below are four key open problems that motivate the need for a meta-reasoning framework:
-
🚨 Lack of self-awarenss in knowledge and ethics : LLMs often exhibit a strong “Feeling of Knowing” but lack crucial human-like cognitive attributes, such as “awareness of limitations” and “awareness of situation”.
-
🔗 Inflexible Strategy : LLMs lack of a flexible reasoning strategy for different individual problems, such as overthinking for simple questions; inefficiency of coordinating diverse tools in different reasoning phases.
-
🎯 Reward Hacking : Reasoning agents exploit flaws in the reward function to achieve high scores without genuinely learning transferable reasoning patterns.
-
📚 Knowledge Updating : Current on-the-fly knowledge retrieval or fine-tuning fails to adequately address the knowledge conflicts and resource inefficiency, especially when we need to inject multi-source knowledge.
Large language models (LLMs) excel at pattern-completion yet often struggle with reliable reasoning—they hallucinate, over-generalise, or overshoot their reward signals.
As shown in Figure 1, the paper proposes a Bayesian meta-reasoning framework that equips an LLM with four interacting modules.
Figure 1: Overview of the framework.
- 🧠 Self-Awareness
- 🔍 Monitoring
- ✅ Evaluation & Regulation
- Critiques the completed reasoning process and corrects errors
- Incorporate knowledge with help from surrogate samples sharing the latent reasoning process.
- Addresses 📚 Open 4)
- 🔄 Meta-Reflection
We explain each component with existing literature as supporting, also actionable insights for next-step research actions.
Goal:
- (1) Calculate the task sovability before attempting;
- (2) generate an initial reasoning strategy based on latent skills
Figure 3: The Self-Awareness module.
Existing studies focus on capability-awareness, i.e., confidence/uncertainty measurement, or mission-awareness.
Figure 4: Types of Uncertainty Quantification. From ACL25 Tutorial.
| Main Category | Sub-Category | Representative Papers |
|---|---|---|
| Capability-awareness | Uncertainty Estimation | Just Ask for Calibration (2023.05); Look Before You Leap (2023.07); Can LLMs Express Their Uncertainty? (2024.02) |
| Knowledgeable Self-awareness | Self-RAG (2023.10); SeaKR (2024.06); KnowSelf (2025.04) | |
| Mission-awareness | LLM Jailbreaking Defence | Situational Awareness (2023.09); Explain Jailbreaking (2024.06); Rapid Response (2024.11) |
| Reward Hack Defence | Sleeper Agents (2024.01); Monitor Reasoning Models (2025.03); Verbalize Reward Hacking (2025.06) |
More papers in Uncertainty quantification:
- LM-Polygraph: Uncertainty Estimation for Language Models EMNLP2023, citation 80+
TLDR: a framework with implementations of a battery of state-of-the-art Uncetainty Estimation methods for LLMs in text generation tasks
- Logical Reasoning in Large Language Models: A Survey 2025
- Chain of Logic: Rule-Based Reasoning with Large Language Models 2024
- LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning 2025
- Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning Neurips 2024
Goal: Guide search with step-level intrinsic, faithful, dynamic and efficient rewards to alleviate reward hacking problem
Figure 4: The Monitoring module.
| Reward Type | Granularity | Representative Papers |
|---|---|---|
| Trained Reward Model | Outcome level | Ouyang et al., 2022; Meta et al., 2024; Qwen et al., 2024 |
| Process level | Shao et al., 2024; Wang et al., 2024; Xie et al., 2024 | |
| LLM as a Judge | Outcome level | Yuan et al., 2025; Whitehouse et al., 2025 |
| Verifiable Reward | Outcome level | Deepseek et al., 2025; Liu et al., 2023; Su et al., 2025; Dou et al., 2025 |
The self-play system, where the evaluator is an evolving agent and internal signals as reward offers a promising alternative for its faithfulness, controllability, and efficiency.
🚀 Self-play system
- A survey on self-evolution of large language models.
- Self-Play Preference Optimization for Language Model Alignment
🚀 Using intrinsic representation as rewards
- Reasoning Models Don't Always Say What They Think 2025. Alignment Science Team, Anthropic
- Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration ICML25, spotlight
- Latent Space Chain-Of-Embedding Enables Output-Free Llm Self-Evaluation. ICLR25
- Learning to Reason without External Rewards 2025
🚀 Compress reasoning process in the latent space for efficiency
- Training Large Language Models to Reason in a Continuous Latent Space 2025, submitted to ICLR
- CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation 2025, submitted to EMNLP
- Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space, ICLR25
↳ Addresses 🔗 Open 2 and 🎯 Open 3.
Goal: Critique and refine the generated reasoning chain with the knowledge from surrogate samples.
Figure 5: The Evaluation and Regulation module.
| Aspect | Category / Technique | Representative Papers (chronological) | Key idea | Noted gaps / limits |
|---|---|---|---|---|
| Evaluation (generating feedback on a full reasoning chain) |
Template-based verbal feedback | Self-Refine (Madaan 23), PromptAgent (Wang 24d), LLM-CF (Tyen 23), Self-Eval LargeLM (Huang 24b) | Use canned prompt templates that ask the model to critique its own output. | Feedback is often shallow and template rigidity limits coverage. |
| Critic-model feedback | Step-Level Preference (Chen 24b), Self-Correct (Kumar 24), Learning-from-Mistakes (Tong 24), Math Error Loc (Uesato 22), REFINER (Paul 24), DARS (Li 25a) | Train a separate classifier/regressor that labels wrong steps. | High annotation cost; critics may localise but not explain errors clearly. | |
| Token-based back-tracking | Backtracking (Zhang 25a), Quiet-STaR (Zelikman 24) | Add special tokens such as [RESET] to let the model roll back and try again. |
Still relies on the base model’s willingness to revise. | |
| Tool-assisted feedback |
Code tools —
Self-Edit (Zhang 23a),
Self-Debug (Chen 24c),
CRITIC (Gou 24),
ReTool (Feng 25a),
ToolRL (Qian 25) Search engines — ReAct (Yao 23b), Check-Facts (Peng 23), Search-R1 (Jin 25), R1-Searcher (Song 25) Logic / topology — Graph-Analyser (Zhang 23b) |
Call external solvers (code interpreters, web searchers, graph tools) to ground or verify intermediate results. | Coverage limited to domains where reliable tools exist. | |
| Pattern-level / multi-instance feedback | Thought-Templates (Yang 25a), Meta-Buffer (Yang 24), Semantic-Symbol Prompts (Wang 24g) | Cluster similar queries, evaluate common error patterns rather than one-off instances. | Still handcrafted; relies on LLM compliance with structured templates. | |
| Regulation (using feedback to repair reasoning) |
Direct self-reflection prompting | Self-Refine (Madaan 23), Self-Reflection Makes LLMs Safer (Liu 24) | Feed critique back to the same model and ask it to revise its answer. | LLMs can be “stubborn” and ignore corrections. |
| Gradient-through-text (TextGrad) | TextGrad (Yuksekgonul 25) | Treat natural-language feedback as a gradient signal and refine the prompt. | Requires differentiable proxy; early-stage research. | |
| Explicit error-correction trajectories | Quiet-STaR (Zelikman 24), Backtracking (Zhang 25a) | Train on pairs (wrong chain → fixed chain) so the model learns how to patch errors. | May over-fit to surface patterns. | |
| ‘Think-mode’ time-outs | DeepSeek-R1 “think” mode (Chen 25) | Insert a deliberate wait to encourage an extra round of internal checks before answering. | Can introduce over-thinking latency. |
🚀 New benchmark with inter-sample error and feedback annotation
🚀 Meta-knowledge incorporation via sharing similar:
-
Template
-
Symbolic match
-
Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models. ACL2024-finidngs
-
Latent reasoning process
-
Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning Neurips 2024
↳ Addresses 📚 Open 4.
Goal: Perform hierarchical Bayesian updates of knowledge priors (I, E) across tasks.
(a) Meta-prompt optimisation
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
- MetaICL: Learning to Learn In Context
- Meta-learning via Language Model In-context Tuning
(b) LoRA decomposition and combination for unseen tasks
(c) Bayesian Inverse Planning
🚀 Mechanistic interpretability for safe training and adaptation
- Toward understanding and preventing misalignment generalization. OpenAI 2025
- Mechanistic Interpretability for AI Safety A Review TMLR2024
- Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models Neurips24
🚀 Hierarchical agentic framework
- LLM-Powered Decentralized Generative Agents with Adaptive Hierarchical Knowledge Graph for Cooperative Planning AAAI25
- AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving 2025
- TOWARDS HIERARCHICAL MULTI-AGENT WORKFLOWS FOR ZERO-SHOT PROMPT OPTIMIZATION ICLR 2025 workshop
- HALO: Hierarchical Autonomous Logic-Oriented Orchestration for Multi-Agent LLM Systems 2025
If this work is helpful, please cite as:
@inproceedings{yan2025position,
title={Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning.},
author={Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://kclpure.kcl.ac.uk/portal/en/publications/position-llms-need-a-bayesian-meta-reasoning-framework-for-more-r}
}




