A investigation into how Large Language Models (LLMs) process counting tasks, combining behavioral benchmarking with causal mediation analysis to understand the internal mechanisms of numerical reasoning.
- π Browse Data - Complete experimental datasets and results
- π Run Experiments - Original Google Colab notebooks
- π Use Code - Python modules
"Is there a hidden state layer that contains a representation of the running count of matching words, while processing the list of words?"
This project addresses this question through experimentation using counterfactual activation patching.
Our benchmark reveals significant performance differences across models
### Important Methodological Note
β οΈ Chat Template Usage DifferenceThe benchmark evaluation and causal mediation analysis use different prompt formatting approaches, which explains accuracy differences for the same model:
- Benchmark: Uses raw model outputs without chat templates to ensure fair comparison across different model families (since each model has different chat template formats)
- Causal Analysis: Uses Phi-4 with proper chat templates for more controlled intervention analysis
This methodological difference means Phi-4's accuracy in benchmark vs. causal analysis may differ, as chat templates can significantly impact model performance. The benchmark provides cross-model comparison, while causal analysis provides mechanistic insights.
llm-counting-mechanisms/
βββ src/
β βββ data_generation.py # Dataset creation with 11 word categories
β βββ model_benchmark.py # Zero-shot evaluation framework
β βββ causal_analysis.py # Counterfactual activation patching
β βββ visualization.py # Result plotting and analysis
βββ notebooks/ # Original Colab notebooks (main experiments)
β βββ 01_data_generation.ipynb # Dataset creation notebook
β βββ 02_model_benchmark.ipynb # Multi-model evaluation notebook
β βββ 03_causal_analysis.ipynb # Causal mediation analysis notebook
βββ data/
β βββ benchmark_results/ # Model evaluation data
β β βββ counting_dataset_5000.csv # Main evaluation dataset
β β βββ word_banks.json # Word categories
β β βββ llama3_1_8b_results.csv # Llama 3.1 8B detailed results
β β βββ phi4_results.csv # Phi-4 detailed results
β β βββ qwen3_8b_results.csv # Qwen3 8B detailed results
β β βββ model_comparison.csv # Cross-model comparison
β β βββ benchmark_report.md # Detailed analysis report
β βββ causal_results/ # Causal mediation data
β βββ cma_intervention_pairs.json # Intervention test cases
β βββ cma_effects_results.csv # Complete effect calculations
β βββ cma_layer_statistics.csv # Layer-wise statistics
β βββ cma_analysis_report.txt # Causal analysis summary
βββ results/
β βββ figures/ # All generated plots
β βββ benchmark_results.csv # Model performance data
β βββ causal_effects.csv # Mediation analysis results
βββ scripts/
β βββ run_benchmark.py # Complete benchmark pipeline
β βββ run_causal_analysis.py # Causal mediation pipeline
β βββ generate_plots.py # Visualization generation
βββ requirements.txt
π Note on Implementation: The main experiments were conducted using Google Colab notebooks, which are available in the
notebooks/directory. Thesrc/directory contains refactored, production-ready Python modules for easy local reproduction and extension.
This repository includes complete experimental data from all conducted experiments:
- Dataset: 5,000 counting examples across 11 categories
- Model Results: Detailed predictions for Llama 3.1 8B, Phi-4, and Qwen3 8B
- Comparisons: Cross-model performance analysis
- Reports: Comprehensive benchmark analysis
- Intervention Pairs: 3,000+ counterfactual test cases
- Effect Calculations: Total Effect (TE) and Indirect Effect (IE) by layer
- Layer Statistics: Mediation strength across all transformer layers
- Analysis Reports: Detailed causal mediation findings
β‘ Ready for Analysis: All data files are included, so you can immediately reproduce plots and analysis without running the computationally expensive model evaluations.
The main experiments were conducted using Google Colab for GPU access. Use the notebooks in notebooks/ directory:
- Data Generation:
notebooks/01_data_generation.ipynb - Model Benchmark:
notebooks/02_model_benchmark.ipynb - Causal Analysis:
notebooks/03_causal_analysis.ipynb
These notebooks include all the original experimental code and can be run directly in Google Colab.
For local reproduction and extension, use the modular Python code:
git clone https://github.com/your-username/llm-counting-mechanisms.git
cd llm-counting-mechanisms
pip install -r requirements.txtfrom src.data_generation import CountingDataGenerator
generator = CountingDataGenerator()
dataset = generator.create_dataset(size=5000)
generator.save_dataset(dataset, "data/counting_dataset_5000.csv")from src.model_benchmark import CountingBenchmark
# Note: Benchmark uses raw model outputs (no chat templates)
# for fair cross-model comparison
benchmark = CountingBenchmark("data/counting_dataset_5000.csv")
results = benchmark.evaluate_models([
"meta-llama/Meta-Llama-3.1-8B-Instruct",
"microsoft/phi-4",
"Qwen/Qwen3-8B"
])from src.causal_analysis import CausalMediationAnalyzer
# Note: Causal analysis uses chat templates for controlled interventions
analyzer = CausalMediationAnalyzer("microsoft/phi-4")
effects = analyzer.run_analysis("data/intervention_pairs.json")- 11 semantic categories: fruit, animal, vehicle, color, body_part, tool, clothing, sport, building, weather, emotion
- Uniform distribution: Equal probability for each possible count (0 to list_length)
- Variable list lengths: 5-10 words per list
- 5,000 examples for robust statistical analysis
- Zero-shot: No reasoning tokens or chain-of-thought
- Controlled formatting: Consistent prompt structure across models
- Extraction method: Regex-based numerical answer parsing
- Intervention method: Counterfactual activation patching
- Target: Single word replacement (target β distractor)
- Measurements: Total Effect (TE) and Indirect Effect (IE)
- Layer coverage: All transformer layers analyzed
- Model: Phi-4 with chat template formatting for consistent intervention analysis
- Primary Platform: Google Colab (for GPU access and reproducibility)
- Local Development: Python 3.8+ with pip package management
- Notebooks: Original experimental code available in
notebooks/directory - Scripts: Refactored production code in
src/directory
transformers>=4.30.0torch>=1.9.0pandas>=1.5.0matplotlib>=3.5.0tqdm>=4.60.0
Currently supports HuggingFace transformer models:
- Llama family (requires access token)
- Phi family
- Qwen family
- Gemma family (with modifications)
- GPU: NVIDIA GPU A100
- CPU: Multi-core processor for data processing
- Memory: 32GB+ RAM for large model analysis
This project is licensed under the MIT License - see the LICENSE file for details.
Keywords: Large Language Models, Causal Mediation Analysis, Numerical Reasoning, Mechanistic Interpretability, Transformer Analysis



