This repository contains the official implementation, data, and experiment materials for the paper “Cognitive Flow: An LLM-Automated Framework for Quantifying Reasoning Distillation.”
For further details, please refer to the full paper for theoretical background, experimental methodology, and complete visual results.
The Cognitive Flow framework converts unstructured CoT text into a structured, quantifiable representation of reasoning style through four main stages:
The reasoning trace between <think> and </think> tags is extracted and divided into discrete reasoning steps using double-newline delimiters (\n\n).
A Label Extractor LLM analyzes a large random sample of reasoning steps (≈1000) to define a concise set of cognitive state labels (e.g., Interpretation, Calculation, Verification).
Prompt templates for this stage are available in cognitive_flow_utils/prompt_templates.py.
A Step Classifier LLM assigns one cognitive label to each reasoning step in a few-shot classification setup. The corresponding prompts are also defined in prompt_templates.py.
Labeled sequences are aggregated into an N×N state transition matrix, capturing the conditional probability of transitions between cognitive states — effectively a “fingerprint” of a model’s reasoning style.
This matrix is used for quantitative comparisons across models through metrics such as Cosine Similarity (CS) and Kullback-Leibler Divergence (KLD).
├── cognitive_flow_utils/
│ ├── dataset_utils.py # Helper functions for handling prompt datasets
│ ├── llm_methods.py # Core functions for querying LLMs
│ ├── models_and_clients.py # Definitions and clients for models (DeepSeek, Gemma, etc.)
│ ├── prompt_templates.py # System prompts for Label Extractor and Step Annotator
│ ├── step_annotation.py # StepAnnotator class for batch annotation
│ └── ...
│
├── mmlu-elementary-maths/
│ ├── elementary_labels.txt
│ ├── elementary_maths_prompts.csv
│ ├── ..._steps.csv
│ └── to_steps.ipynb
│
├── mmlu-high-school-maths/
│ ├── high_school_labels.txt
│ ├── hs_maths_prompts.csv
│ ├── ..._steps.csv
│ └── to_steps.ipynb
│
├── mmlu-college-maths/
│ ├── college_labels.txt
│ ├── college_maths_prompts.csv
│ ├── ..._steps.csv
│ └── to_steps.ipynb
│
├── get_completions_from_prompts.py # Generate model reasoning completions
├── annotate_steps_dataset.py # Label reasoning steps
├── flow_analysis.ipynb # Cognitive Flow (matrix/graph) analysis
├── state_distribution_analysis.ipynb # Cognitive state frequency analysis
├── token_distribution_analysis.ipynb # Token effort distribution analysis
└── README.md
If using API-served models, configure API keys (e.g., DeepSeek, Groq) as environment variables for use in
cognitive_flow_utils/models_and_clients.py.
The experimental pipeline consists of four main stages:
Run the reasoning generation script:
python get_completions_from_prompts.pyThis script produces raw reasoning outputs (CoTs) for a target model and dataset.
Parameters like dataset path, model, and temperature are configured within the script.
Use the notebooks (to_steps.ipynb) in each dataset folder to:
- Extract
<think>text segments - Split reasoning into individual steps (
\n\n) - Save as
..._steps.csv
Classify reasoning steps using:
python annotate_steps_dataset.pyThis script assigns cognitive labels to each step and can optionally generate a new label set.
With annotated datasets, use the Jupyter notebooks to:
- Build state transition matrices
- Compute CS and KLD between models
- Generate visualizations
Experiments are based on three subsets of the MMLU benchmark, representing increasing task complexity:
- Elementary Maths
- High School Maths
- College Maths
Each subset includes:
- The original MMLU prompts
- Cognitive label sets generated via the Label Extractor LLM
- Annotated reasoning steps for each evaluated model
All data are available within their corresponding /mmlu-* directories.
The Cognitive Flow framework provides a quantitative lens on reasoning transfer.
Analysis of the DeepSeek-R1 model family shows that:
- High similarity is observed between teacher and student reasoning on medium-complexity tasks.
- Divergence increases significantly on both simple and highly complex tasks.
- Distilled models tend to underperform in “Verification”-related reasoning, neglecting cognitive self-checking.
- Independently trained RL-based models (e.g., QwenQwQ-32B) display more balanced and adaptable reasoning flows.
These findings suggest that while KD effectively transmits surface reasoning structure, it may not capture deeper, flexible cognitive strategies.