This project uses Natural Language Processing (NLP) and Machine Learning to identify student math misconceptions from open-ended responses. It is based on the Kaggle competition dataset.
The project implements a 3-stage modeling approach:
- Binary Classification: Predict correct vs. incorrect answers.
- 3-Class Classification: Categorize explanations (Correct, Misconception, Neither).
- Multiclass Classification: Identify specific misconception types (35+ categories).
Term_Project_geissinger_final.ipynb: Main analysis and modeling notebook.project_math/: Folder containing the dataset (train.csv,test.csv).requirements.txt: List of dependencies.
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
- Run the notebook:
Open
Term_Project_geissinger_final.ipynband run all cells. The notebook is configured to look for data in theproject_math/directory by default.
- Text Representation: TF-IDF and Sentence Transformers (embeddings).
- Classifiers: Random Forest and Logistic Regression.
- Handling Imbalance: BorderlineSMOTE.