This project is a Python framework to analyze the logical reasoning capabilities of Large Language Models (LLMs) on LSAT (Law School Admission Test) problems.
It feeds problems to an LLM (like Google's Gemini), parses the model's step-by-step reasoning, and generates "reasoning maps" to visualize the logical chain. Finally, it runs a batch analysis to find and categorize recurring patterns of error.
This framework operates in three main stages:
- Loads 50+ problems from the
tasksource/lsat-lrdataset and queries the Gemini API for a step-by-step analysis of each.
- Parses the LLM's text response. It then builds a
networkxgraph to create a "reasoning map" (e.g.,Context->Argument Breakdown->Final Conclusion) and saves it as a .png.
- Reads the
results.csvfile, checks the LLM's answer against the ground truth, and categorizes each question by type (e.g., "Flaw," "Assumption") to find and chart recurring error patterns. The full list of categories can be seen in thecategorize_questionfunction.
The project relies on the following key technologies and libraries:
- Language Model (LLM): Google Gemini API (
gemini-2.5-flash-preview-09-2025) - Core Language: Python 3
- Data Handling: pandas, HuggingFace datasets (for LSAT data)
- Networking: httpx (asynchronous API calls)
- Graph/Visualization: networkx, matplotlib (for Reasoning Maps and Bar Charts)
- Utilities: python-dotenv, asyncio, re (for robust text parsing)
git clone [your-repo-url]
cd ReasoningMaps-LLM- Make sure you have all the required Python libraries.
pip install -r requirements.txt- This project requires a Google Gemini API key.
- Create a file named .env in the main folder.
- Add your API key to it like this:
LLM_KEY="YOUR_API_KEY"The analysis is a two-step process:
- Run
main.pyto fetch the problems, query the LLM, and generate the maps. This will analyze 50 problems and may take 5-6 minutes to complete due to API rate limits.
python main.py- After running
main.py, the script will produce two outputs:- results.csv: A spreadsheet containing the detailed results for each problem.
- reasoning_maps/: A folder containing 50 visualization maps, one for each problem.
- Run
analyze_results.pyto read theresults.csvfile and print a summary report of the LLM's performance and error patterns.
python analyze_results.pyRunning python analyze_results.py will produce a report like this:
--- Analyzing Results from results.csv ---
--- API & Parsing Health ---
Total Problems Processed: 50
API Errors (e.g., 429 Limit): 1
--- Overall Performance (on successful requests) ---
Total Successful Requests: 49
Correct Answers: 48
Incorrect Answers: 1
LLM Accuracy: 97.96%
--- Recurring Patterns of Error ---
The LLM struggled most with the following question types:
- Method of Reasoning
(Failed 1 time(s))The report shows one failure. By checking results.csv for the failed problem, we can find its corresponding map in the reasoning_maps/ folder. This map visualizes the exact logical chain where the LLM failed (note the red "Final Conclusion" node).
The script also generates a bar chart showing the error counts for all tested categories, providing a clear picture of the LLM's strengths and weaknesses.
- Data Source: The
tasksource/lsat-lrdataset from Hugging Face for providing a robust benchmark of logical reasoning problems. - LLM API Service: The Google Gemini API for providing a high-performance, accessible service for LLM querying.
This project is licensed under the MIT License.
See the LICENSE file for details.

