Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning

Paper: Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning (ICML 2025) [link]

❓ 1. Open Problem

Large Language Models (LLMs) face several fundamental challenges in reasoning and decision-making. Below are four key open problems that motivate the need for a meta-reasoning framework:

🚨 Lack of self-awarenss in knowledge and ethics : LLMs often exhibit a strong “Feeling of Knowing” but lack crucial human-like cognitive attributes, such as “awareness of limitations” and “awareness of situation”.
🔗 Inflexible Strategy : LLMs lack of a flexible reasoning strategy for different individual problems, such as overthinking for simple questions; inefficiency of coordinating diverse tools in different reasoning phases.
🎯 Reward Hacking : Reasoning agents exploit flaws in the reward function to achieve high scores without genuinely learning transferable reasoning patterns.
📚 Knowledge Updating : Current on-the-fly knowledge retrieval or fine-tuning fails to adequately address the knowledge conflicts and resource inefficiency, especially when we need to inject multi-source knowledge.

✨ 2. Overall Framework

Large language models (LLMs) excel at pattern-completion yet often struggle with reliable reasoning—they hallucinate, over-generalise, or overshoot their reward signals.
As shown in Figure 1, the paper proposes a Bayesian meta-reasoning framework that equips an LLM with four interacting modules.

Figure 1: Overview of the framework.

How can these modules alleviate the above limitations?

🧠 Self-Awareness
- Establish a unified framework to measure task solvability before conducting the reasoning step generation
- Outputs an initial adaptive strategy F for adaptive reasoning
- Addresses 🚨 Open 1, 🔗 Open 2).
🔍 Monitoring
- Given the initial reasoning strategy, verifying each reasoning step
- Evaluate the intermediate reasoning process with intrinsic and dynamic reward
- Addresses 🔗 Open 2, 🎯 Open 3.
✅ Evaluation & Regulation
- Critiques the completed reasoning process and corrects errors
- Incorporate knowledge with help from surrogate samples sharing the latent reasoning process.
- Addresses 📚 Open 4)
🔄 Meta-Reflection
- Updates the model parameter and initial reasoning strategy
- Utilise the meta-observations across multiple samples and address the knowledge conflicts
- Addresses 🔗 Open 2 📚 Open 4).

🧩 3. Literature review with actionable insights

We explain each component with existing literature as supporting, also actionable insights for next-step research actions.

🧠 3.1. Self-Awareness

Goal:

(1) Calculate the task sovability before attempting;
(2) generate an initial reasoning strategy based on latent skills

Figure 3: The Self-Awareness module.

Related work:

Existing studies focus on capability-awareness, i.e., confidence/uncertainty measurement, or mission-awareness.

Figure 4: Types of Uncertainty Quantification. From ACL25 Tutorial.

Main Category	Sub-Category	Representative Papers
Capability-awareness	Uncertainty Estimation	Just Ask for Calibration (2023.05); Look Before You Leap (2023.07); Can LLMs Express Their Uncertainty? (2024.02)
Capability-awareness	Knowledgeable Self-awareness	Self-RAG (2023.10); SeaKR (2024.06); KnowSelf (2025.04)
Mission-awareness	LLM Jailbreaking Defence	Situational Awareness (2023.09); Explain Jailbreaking (2024.06); Rapid Response (2024.11)
Mission-awareness	Reward Hack Defence	Sleeper Agents (2024.01); Monitor Reasoning Models (2025.03); Verbalize Reward Hacking (2025.06)

More papers in Uncertainty quantification:

LM-Polygraph: Uncertainty Estimation for Language Models EMNLP2023, citation 80+
TLDR: a framework with implementations of a battery of state-of-the-art Uncetainty Estimation methods for LLMs in text generation tasks

Actionable insights

⚠️ A unified framework that integrates multi-aspect task solvability—including factors beyond knowledge boundaries and ethical considerations—such as prioritizing efficiency or addressing constraints for specific user groups (e.g., teenagers).

⚠️ Lack of adaptability in diverse latent skills measurement/selection.

🔍 3.2. Monitoring

Goal: Guide search with step-level intrinsic, faithful, dynamic and efficient rewards to alleviate reward hacking problem

Figure 4: The Monitoring module.

Related work

Reward Type	Granularity	Representative Papers
Trained Reward Model	Outcome level	Ouyang et al., 2022; Meta et al., 2024; Qwen et al., 2024
Trained Reward Model	Process level	Shao et al., 2024; Wang et al., 2024; Xie et al., 2024
LLM as a Judge	Outcome level	Yuan et al., 2025; Whitehouse et al., 2025
Verifiable Reward	Outcome level	Deepseek et al., 2025; Liu et al., 2023; Su et al., 2025; Dou et al., 2025

Actionable insights

⚠️ Verified reward, pre-trained reward model, or using LLM-as-a-Judge have notable limitations: they often overlook reasoning diversity, rely on expensive human annotation, are not reliable, not adaptive to changing environments.

The self-play system, where the evaluator is an evolving agent and internal signals as reward offers a promising alternative for its faithfulness, controllability, and efficiency.

🚀 Self-play system

🚀 Using intrinsic representation as rewards

Reasoning Models Don't Always Say What They Think 2025. Alignment Science Team, Anthropic
Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration ICML25, spotlight
Latent Space Chain-Of-Embedding Enables Output-Free Llm Self-Evaluation. ICLR25
Learning to Reason without External Rewards 2025

🚀 Compress reasoning process in the latent space for efficiency

Training Large Language Models to Reason in a Continuous Latent Space 2025, submitted to ICLR
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation 2025, submitted to EMNLP
Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space, ICLR25

↳ Addresses 🔗 Open 2 and 🎯 Open 3.

✅ 3.3. Evaluation and Regulation

Goal: Critique and refine the generated reasoning chain with the knowledge from surrogate samples.

Figure 5: The Evaluation and Regulation module.

Related work

Aspect	Category / Technique	Representative Papers (chronological)	Key idea	Noted gaps / limits
Evaluation (generating feedback on a full reasoning chain)	Template-based verbal feedback	Self-Refine (Madaan 23), PromptAgent (Wang 24d), LLM-CF (Tyen 23), Self-Eval LargeLM (Huang 24b)	Use canned prompt templates that ask the model to critique its own output.	Feedback is often shallow and template rigidity limits coverage.
	Critic-model feedback	Step-Level Preference (Chen 24b), Self-Correct (Kumar 24), Learning-from-Mistakes (Tong 24), Math Error Loc (Uesato 22), REFINER (Paul 24), DARS (Li 25a)	Train a separate classifier/regressor that labels wrong steps.	High annotation cost; critics may localise but not explain errors clearly.
	Token-based back-tracking	Backtracking (Zhang 25a), Quiet-STaR (Zelikman 24)	Add special tokens such as `[RESET]` to let the model roll back and try again.	Still relies on the base model’s willingness to revise.
	Tool-assisted feedback	Code tools — Self-Edit (Zhang 23a), Self-Debug (Chen 24c), CRITIC (Gou 24), ReTool (Feng 25a), ToolRL (Qian 25) Search engines — ReAct (Yao 23b), Check-Facts (Peng 23), Search-R1 (Jin 25), R1-Searcher (Song 25) Logic / topology — Graph-Analyser (Zhang 23b)	Call external solvers (code interpreters, web searchers, graph tools) to ground or verify intermediate results.	Coverage limited to domains where reliable tools exist.
	Pattern-level / multi-instance feedback	Thought-Templates (Yang 25a), Meta-Buffer (Yang 24), Semantic-Symbol Prompts (Wang 24g)	Cluster similar queries, evaluate common error patterns rather than one-off instances.	Still handcrafted; relies on LLM compliance with structured templates.
Regulation (using feedback to repair reasoning)	Direct self-reflection prompting	Self-Refine (Madaan 23), Self-Reflection Makes LLMs Safer (Liu 24)	Feed critique back to the same model and ask it to revise its answer.	LLMs can be “stubborn” and ignore corrections.
	Gradient-through-text (TextGrad)	TextGrad (Yuksekgonul 25)	Treat natural-language feedback as a gradient signal and refine the prompt.	Requires differentiable proxy; early-stage research.
	Explicit error-correction trajectories	Quiet-STaR (Zelikman 24), Backtracking (Zhang 25a)	Train on pairs (wrong chain → fixed chain) so the model learns how to patch errors.	May over-fit to surface patterns.
	‘Think-mode’ time-outs	DeepSeek-R1 “think” mode (Chen 25)	Insert a deliberate wait to encourage an extra round of internal checks before answering.	Can introduce over-thinking latency.

Actionable insights

⚠️ To enable multi-sample/meta-level error analysis and correction, we need new benchmarks that include rich error and feedback annotations. This allows us to link samples with similar mistakes and let them inform each other’s solving processes.

🚀 New benchmark with inter-sample error and feedback annotation

From rankings to insights: Evaluation should shift focus from leaderboard to feedback 2025

⚠️ To enable relevant knowledge incorporation beyond embedding-based similarity for input questions, we can consider sharing latent-skill, similar distilled reasoning patterns match, such as template, symbolic form, or causal graph underlying.

🚀 Meta-knowledge incorporation via sharing similar:

Template
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates. 2025
Symbolic match
Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models. ACL2024-finidngs
Latent reasoning process
Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning Neurips 2024

↳ Addresses 📚 Open 4.

🔄 3.4. Meta-Reflection

Goal: Perform hierarchical Bayesian updates of knowledge priors (I, E) across tasks.

Related work

(a) Meta-prompt optimisation

(b) LoRA decomposition and combination for unseen tasks

LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition COLM24

(c) Bayesian Inverse Planning

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling 2025

Actionable insights

⚠️ knowledge conflicts or inefficiency when updating multi-source knowledge

🚀 Mechanistic interpretability for safe training and adaptation

🚀 Hierarchical agentic framework

↳ Addresses 🔗 Open 2 📚 Open 4.

📖 Citation

If this work is helpful, please cite as:

@inproceedings{yan2025position,
      title={Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning.},
      author={Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He},
      booktitle={Forty-second International Conference on Machine Learning},
      year={2025},
      url={https://kclpure.kcl.ac.uk/portal/en/publications/position-llms-need-a-bayesian-meta-reasoning-framework-for-more-r}
}

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
UQ_cates.png		UQ_cates.png
bayesian.png		bayesian.png
monitor_selfplay.png		monitor_selfplay.png
monitoring.png		monitoring.png
old_README.md		old_README.md
overview.png		overview.png
readme.md		readme.md
regulation.png		regulation.png
self_awareness.png		self_awareness.png
uncertainty_ACL25_tutorial.md		uncertainty_ACL25_tutorial.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning

❓ 1. Open Problem

✨ 2. Overall Framework

How can these modules alleviate the above limitations?

🧩 3. Literature review with actionable insights

🧠 3.1. Self-Awareness

Related work:

Actionable insights

🔍 3.2. Monitoring

Related work

Actionable insights

✅ 3.3. Evaluation and Regulation

Related work

Actionable insights

🔄 3.4. Meta-Reflection

Related work

Actionable insights

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning

❓ 1. Open Problem

✨ 2. Overall Framework

How can these modules alleviate the above limitations?

🧩 3. Literature review with actionable insights

🧠 3.1. Self-Awareness

Related work:

Actionable insights

🔍 3.2. Monitoring

Related work

Actionable insights

✅ 3.3. Evaluation and Regulation

Related work

Actionable insights

🔄 3.4. Meta-Reflection

Related work

Actionable insights

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages