Reasoning Authentication Framework (RAF)

A Stratified Complexity Framework for Evaluating Compositional Reasoning in LLMs

RAF is a software framework developed to operationally evaluate compositional reasoning behavior in Large Language Models (LLMs) under controlled complexity scaling. The framework is designed to distinguish between cognitive failure (reasoning degradation under task complexity) and system failure (artifacts arising from infrastructure limits such as token truncation or parsing errors).

Rather than making claims about internal cognition, RAF focuses on observable, reproducible behavioral signals obtained through carefully structured evaluation pipelines.

🏗️ System Architecture (v3.0)

RAF follows a decoupled, artifact-aware pipeline architecture to isolate reasoning variables and ensure measurement fidelity.

Problem Generator (test_builder.py)
Generates arithmetic reasoning tasks with linear operation scaling, ensuring monotonic and fine-grained complexity growth.
Evaluator (evaluator.py)
Orchestrates API calls across multiple model families with structured logging and failure handling.
Parser & Judge (response_parser.py)
Applies Regex-based validation to separate malformed or truncated outputs from logically incorrect reasoning.
Metrics Calculator (calculator.py)
Computes R-CDS (Robust Compositional Decay Score) to capture non-linear performance collapses across complexity levels.
Monitoring Layer
A lightweight Streamlit dashboard visualizes accuracy trends and decay behavior in real time.

🏛️ Architecture Evolution

Feature	Phase I & II (Diagnostic Scripts)	Phase III / v3.0 (Framework)
Logic Structure	Monolithic evaluation flow	Decoupled generator, evaluator, and analysis
Complexity Unit	Nesting depth	Linear operation count
Error Handling	Mixed logic & system failures	Artifact-aware parsing separation
Primary Metric	CDS (aggregate decay)	R-CDS (shape-aware decay)
Persistence	Console logs	Structured JSONL artifacts

📈 Framework Evolution

RAF evolved through three iterative phases focused on reducing experimental noise and improving interpretability.

Feature	Phase I	Phase II	Phase III
Complexity Control	Nesting depth (0–10)	Reduced depth (0–5)	Linear ops (1–29)
Scaling Behavior	Non-monotonic jumps	Stabilized depth	Monotonic increments
Evaluation Metric	CDS	CDS	R-CDS
System Stability	Low	Medium	High
Model Scope	Single model	Single model	Multi-model

Key Methodological Shifts

Depth → Operations
Nesting depth introduced hidden non-linear operation growth. Linear ops enabled finer and more interpretable scaling.
Infrastructure Isolation
Failures beyond certain depths were often attributable to serving-layer constraints rather than reasoning limits, motivating explicit artifact separation.
Shape-Aware Evaluation
R-CDS was introduced to penalize sharp accuracy collapses that aggregate metrics fail to capture.

📊 Benchmarking & Results

Robust Compositional Decay Score (R-CDS)

R-CDS combines overall performance with sensitivity to sudden degradation:

$$RCDS = AUC \times (1 - D_{max})$$

where:

AUC represents accuracy over increasing complexity
$$D_{max}$$ is the maximum drop between consecutive complexity levels

Illustrative Multi-Model Comparison (Phase III)

Model	Ops Range	Accuracy Trend	Max Drop	R-CDS
ChatGPT-OSS 120B	1–19	Stable plateau	0.05	0.95
ChatGPT-OSS 20B	1–19	Near-perfect plateau	0.00	1.00
LLaMA 3.1 8B	1–29	Gradual decay, sharp cliff	0.30	0.70

Results are intended to demonstrate measurement behavior, not to serve as a definitive leaderboard.

🎯 Conclusion

RAF demonstrates that careful system design and artifact-aware evaluation can reveal structured reasoning degradation patterns that are otherwise obscured by infrastructure noise. The framework emphasizes:

Behavioral evaluation of reasoning under controlled complexity
Separation of system artifacts from logical failures
Shape-aware metrics for identifying non-linear performance collapse
Scalable benchmarking infrastructure across model sizes

RAF is positioned as a measurement framework, not a claim about internal model cognition.

🔭 Future Work

Extend Information-Constrained Compositional Reasoning (ICCR) tasks as downstream benchmarks built on RAF.
Introduce tighter information budgets and constrained query interfaces.
Analyze query traces as first-class evaluation artifacts.
Study robustness of planning behavior across model scales.
Open-source task generators and evaluators for reproducible research.

Institution: Indian Institute of Information Technology Guwahati (IIITG)
Author: Hillol Pratim Kalita
Advisor: Prof. Ferdous A. Barbhuiya
Date: December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
src		src
sys_design		sys_design
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
iccr_main.py		iccr_main.py
iccr_test_builder.py		iccr_test_builder.py
main.py		main.py
requirements.txt		requirements.txt
test_parser_debug.py		test_parser_debug.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning Authentication Framework (RAF)

A Stratified Complexity Framework for Evaluating Compositional Reasoning in LLMs

🏗️ System Architecture (v3.0)

🏛️ Architecture Evolution

📈 Framework Evolution

Key Methodological Shifts

📊 Benchmarking & Results

Robust Compositional Decay Score (R-CDS)

Illustrative Multi-Model Comparison (Phase III)

🎯 Conclusion

🔭 Future Work

About

Uh oh!

Releases 1

Packages

Languages

License

ringerH/ReasoningAuthenticationFramework-RAF-

Folders and files

Latest commit

History

Repository files navigation

Reasoning Authentication Framework (RAF)

A Stratified Complexity Framework for Evaluating Compositional Reasoning in LLMs

🏗️ System Architecture (v3.0)

🏛️ Architecture Evolution

📈 Framework Evolution

Key Methodological Shifts

📊 Benchmarking & Results

Robust Compositional Decay Score (R-CDS)

Illustrative Multi-Model Comparison (Phase III)

🎯 Conclusion

🔭 Future Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages