Skip to content

hanqi-qi/LLM_MetaReasoning

Repository files navigation

Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning

Paper: Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning (ICML 2025) [link]


❓ 1. Open Problem

Large Language Models (LLMs) face several fundamental challenges in reasoning and decision-making. Below are four key open problems that motivate the need for a meta-reasoning framework:

  • 🚨 Lack of self-awarenss in knowledge and ethics : LLMs often exhibit a strong “Feeling of Knowing” but lack crucial human-like cognitive attributes, such as “awareness of limitations” and “awareness of situation”.

  • 🔗 Inflexible Strategy : LLMs lack of a flexible reasoning strategy for different individual problems, such as overthinking for simple questions; inefficiency of coordinating diverse tools in different reasoning phases.

  • 🎯 Reward Hacking : Reasoning agents exploit flaws in the reward function to achieve high scores without genuinely learning transferable reasoning patterns.

  • 📚 Knowledge Updating : Current on-the-fly knowledge retrieval or fine-tuning fails to adequately address the knowledge conflicts and resource inefficiency, especially when we need to inject multi-source knowledge.


✨ 2. Overall Framework

Large language models (LLMs) excel at pattern-completion yet often struggle with reliable reasoning—they hallucinate, over-generalise, or overshoot their reward signals.
As shown in Figure 1, the paper proposes a Bayesian meta-reasoning framework that equips an LLM with four interacting modules.

Figure 1: Overview of the framework.

How can these modules alleviate the above limitations?

  • 🧠 Self-Awareness
    • Establish a unified framework to measure task solvability before conducting the reasoning step generation
    • Outputs an initial adaptive strategy F for adaptive reasoning
    • Addresses 🚨 Open 1, 🔗 Open 2).
  • 🔍 Monitoring
    • Given the initial reasoning strategy, verifying each reasoning step
    • Evaluate the intermediate reasoning process with intrinsic and dynamic reward
    • Addresses 🔗 Open 2, 🎯 Open 3.
  • ✅ Evaluation & Regulation
    • Critiques the completed reasoning process and corrects errors
    • Incorporate knowledge with help from surrogate samples sharing the latent reasoning process.
    • Addresses 📚 Open 4)
  • 🔄 Meta-Reflection
    • Updates the model parameter and initial reasoning strategy
    • Utilise the meta-observations across multiple samples and address the knowledge conflicts
    • Addresses 🔗 Open 2📚 Open 4).

🧩 3. Literature review with actionable insights

We explain each component with existing literature as supporting, also actionable insights for next-step research actions.

🧠 3.1. Self-Awareness

Goal:

  • (1) Calculate the task sovability before attempting;
  • (2) generate an initial reasoning strategy based on latent skills

Figure 3: The Self-Awareness module.

Related work:

Existing studies focus on capability-awareness, i.e., confidence/uncertainty measurement, or mission-awareness.

Figure 4: Types of Uncertainty Quantification. From ACL25 Tutorial.

Main Category Sub-Category Representative Papers
Capability-awareness Uncertainty Estimation Just Ask for Calibration (2023.05); Look Before You Leap (2023.07); Can LLMs Express Their Uncertainty? (2024.02)
Knowledgeable Self-awareness Self-RAG (2023.10); SeaKR (2024.06); KnowSelf (2025.04)
Mission-awareness LLM Jailbreaking Defence Situational Awareness (2023.09); Explain Jailbreaking (2024.06); Rapid Response (2024.11)
Reward Hack Defence Sleeper Agents (2024.01); Monitor Reasoning Models (2025.03); Verbalize Reward Hacking (2025.06)

More papers in Uncertainty quantification:

Actionable insights

⚠️ A unified framework that integrates multi-aspect task solvability—including factors beyond knowledge boundaries and ethical considerations—such as prioritizing efficiency or addressing constraints for specific user groups (e.g., teenagers).

⚠️ Lack of adaptability in diverse latent skills measurement/selection.

🔍 3.2. Monitoring

Goal: Guide search with step-level intrinsic, faithful, dynamic and efficient rewards to alleviate reward hacking problem

Figure 4: The Monitoring module.

Related work

Reward Type Granularity Representative Papers
Trained Reward Model Outcome level Ouyang et al., 2022; Meta et al., 2024; Qwen et al., 2024
Process level Shao et al., 2024; Wang et al., 2024; Xie et al., 2024
LLM as a Judge Outcome level Yuan et al., 2025; Whitehouse et al., 2025
Verifiable Reward Outcome level Deepseek et al., 2025; Liu et al., 2023; Su et al., 2025; Dou et al., 2025

Actionable insights

⚠️ Verified reward, pre-trained reward model, or using LLM-as-a-Judge have notable limitations: they often overlook reasoning diversity, rely on expensive human annotation, are not reliable, not adaptive to changing environments.

The self-play system, where the evaluator is an evolving agent and internal signals as reward offers a promising alternative for its faithfulness, controllability, and efficiency.

🚀 Self-play system

🚀 Using intrinsic representation as rewards

🚀 Compress reasoning process in the latent space for efficiency

Addresses 🔗 Open 2 and 🎯 Open 3.


✅ 3.3. Evaluation and Regulation

Goal: Critique and refine the generated reasoning chain with the knowledge from surrogate samples.

Figure 5: The Evaluation and Regulation module.

Related work

Aspect Category / Technique Representative Papers (chronological) Key idea Noted gaps / limits
Evaluation
(generating feedback on a full reasoning chain)
Template-based verbal feedback Self-Refine (Madaan 23)PromptAgent (Wang 24d)LLM-CF (Tyen 23)Self-Eval LargeLM (Huang 24b) Use canned prompt templates that ask the model to critique its own output. Feedback is often shallow and template rigidity limits coverage.
Critic-model feedback Step-Level Preference (Chen 24b)Self-Correct (Kumar 24)Learning-from-Mistakes (Tong 24)Math Error Loc (Uesato 22)REFINER (Paul 24)DARS (Li 25a) Train a separate classifier/regressor that labels wrong steps. High annotation cost; critics may localise but not explain errors clearly.
Token-based back-tracking Backtracking (Zhang 25a)Quiet-STaR (Zelikman 24) Add special tokens such as [RESET] to let the model roll back and try again. Still relies on the base model’s willingness to revise.
Tool-assisted feedback Code tools — Self-Edit (Zhang 23a)Self-Debug (Chen 24c)CRITIC (Gou 24)ReTool (Feng 25a)ToolRL (Qian 25)
Search engines — ReAct (Yao 23b)Check-Facts (Peng 23)Search-R1 (Jin 25)R1-Searcher (Song 25)
Logic / topology — Graph-Analyser (Zhang 23b)
Call external solvers (code interpreters, web searchers, graph tools) to ground or verify intermediate results. Coverage limited to domains where reliable tools exist.
Pattern-level / multi-instance feedback Thought-Templates (Yang 25a)Meta-Buffer (Yang 24)Semantic-Symbol Prompts (Wang 24g) Cluster similar queries, evaluate common error patterns rather than one-off instances. Still handcrafted; relies on LLM compliance with structured templates.
Regulation
(using feedback to repair reasoning)
Direct self-reflection prompting Self-Refine (Madaan 23)Self-Reflection Makes LLMs Safer (Liu 24) Feed critique back to the same model and ask it to revise its answer. LLMs can be “stubborn” and ignore corrections.
Gradient-through-text (TextGrad) TextGrad (Yuksekgonul 25) Treat natural-language feedback as a gradient signal and refine the prompt. Requires differentiable proxy; early-stage research.
Explicit error-correction trajectories Quiet-STaR (Zelikman 24)Backtracking (Zhang 25a) Train on pairs (wrong chain → fixed chain) so the model learns how to patch errors. May over-fit to surface patterns.
‘Think-mode’ time-outs DeepSeek-R1 “think” mode (Chen 25) Insert a deliberate wait to encourage an extra round of internal checks before answering. Can introduce over-thinking latency.

Actionable insights

⚠️ To enable multi-sample/meta-level error analysis and correction, we need new benchmarks that include rich error and feedback annotations. This allows us to link samples with similar mistakes and let them inform each other’s solving processes.

🚀 New benchmark with inter-sample error and feedback annotation

⚠️ To enable relevant knowledge incorporation beyond embedding-based similarity for input questions, we can consider sharing latent-skill, similar distilled reasoning patterns match, such as template, symbolic form, or causal graph underlying.

🚀 Meta-knowledge incorporation via sharing similar:

Addresses 📚 Open 4.


🔄 3.4. Meta-Reflection

Goal: Perform hierarchical Bayesian updates of knowledge priors (I, E) across tasks.

Related work

(a) Meta-prompt optimisation

(b) LoRA decomposition and combination for unseen tasks

(c) Bayesian Inverse Planning

Actionable insights

⚠️ knowledge conflicts or inefficiency when updating multi-source knowledge

🚀 Mechanistic interpretability for safe training and adaptation

🚀 Hierarchical agentic framework

Addresses 🔗 Open 2📚 Open 4.


📖 Citation

If this work is helpful, please cite as:

@inproceedings{yan2025position,
      title={Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning.},
      author={Hanqi Yan, Linhai Zhang, Jiazheng Li, Zhenyi Shen, and Yulan He},
      booktitle={Forty-second International Conference on Machine Learning},
      year={2025},
      url={https://kclpure.kcl.ac.uk/portal/en/publications/position-llms-need-a-bayesian-meta-reasoning-framework-for-more-r}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors