Mechanistic Interpretability of Information Evolution in Large Language Models
Layer-Probing is a research project for analyzing the internal representations of Large Language Models (LLMs). By training MLP probes on the residual streams of transformer layers, this project tracks how predictive information about the next token emerges and stabilizes across layers — evaluated on mathematical reasoning tasks from the GSM8K benchmark.
Traditional LLM evaluation relies heavily on final-layer outputs. Layer-Probing measures how well each layer's MLP probe matches the final layer's prediction (via accuracy and KL divergence), and how well it matches the ground truth token (via CE loss).
This project provides tools to analyze and visualize:
- Information Bottlenecks: Identifying at which layer a model's internal representations converge toward its final prediction, using MLP probes trained on residual stream activations.
- Model Comparisons: Comparing both the raw text outputs and the internal residual stream dynamics between a base instruction-tuned model (Qwen2) and a reasoning-distilled model (DeepSeek-R1).
- Residual Stream Dynamics: Measuring how the entropy and variance of activations evolve across layers as information flows through the network.
When probing DeepSeek-R1-Distill-Qwen-7B on GSM8K data, both KL Divergence (~1.6) and CE Loss (~2.4) drop sharply after layer 0 and stabilize by layer 1–2, showing the model aligns with its final output very early. Probe accuracy peaks at ~0.77 around layers 21–23, with a notable CE/KL spike at the final layer suggesting the last layer actively reshapes the representation.
DeepSeek-R1 plateaus at a significantly higher entropy (~13) compared to Qwen2 (~10) across all middle layers. The contrast is sharpest in last-token dynamics — DeepSeek-R1 commits to a high-entropy representation by layer 12, while Qwen2 increases slowly and linearly throughout, never plateauing.
For the prompt "the dog the dog the dog...", DeepSeek-R1 is highly confident with a top-50 cumulative probability of 0.9736, while Qwen2 is far more uncertain at 0.5303, with probability spread across many candidate tokens.
- Sequential Probe Training: Trains individual MLP probes for every transformer layer using memory-optimized sequential processing to prevent Out-Of-Memory (OOM) errors.
- KL Divergence & CE Loss Metrics: Measures how closely each layer's hidden state aligns with the final output distribution using KL divergence, and against ground truth using cross-entropy loss.
- Residual Stream Entropy Analysis: Calculates layer-wise log entropy via the Frobenius norm of the covariance matrix, enabling comparison of information dynamics across models.
- GSM8K Integration: Benchmarked on math reasoning tasks to track where next-token predictions converge across layers.
- Python 3.10+
- CUDA-enabled GPU (Highly recommended for probe training)
git clone https://github.com/nitesh-77/Layer-Probing.git
cd Layer-Probing
pip install -r requirements.txtTrain diagnostic MLP probes on the GSM8K dataset to measure when internal states begin to align with the final output:
python information_level_identifier.py --model "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" --batch-size 64 --gradient-accum 4 --logging-steps 100Note: Use
--max-layers Nto limit training to the first N layers if you run into memory issues.
Observe differences in generated completions between Qwen2 and DeepSeek-R1:
python model_comparison.py --prompt "If 3x + 5 = 20, what is x?" --max_tokens 200 --temperature 0.7Note: If
--promptis omitted, the script will ask for input interactively.
Generate plots for layer-wise accuracy, CE loss, and KL divergence:
python visualize_layers.py --probe-dir "./" --model "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" --num-examples 1000Compare information complexity across model architectures:
python residual_stream_viz.pyNote: The prompt and models are configured directly in
residual_stream_viz.py. Edit themain()function to change them.
MIT