Skip to content

CarolineLyu/Characterize_GenAIModel

Repository files navigation

Characterize_GenAIModel

Multi-Class Text Classification: Identifying Generative AI Models from User Survey Responses Classifying which generative AI tool (Claude, Gemini, or ChatGPT) a user interacts with, based on 800+ natural language survey responses combining free-text, Likert-scale, and multi-select features. Compares classical ML and deep learning approaches with a two-stage hyperparameter optimization pipeline. Approach Data: Mixed-type survey responses, including 3 free-text columns, 4 ordinal (Likert/frequency) ratings, and 2 multi-select categorical fields. Preprocessing pipeline (scikit-learn):

Free-text features: Separate TF-IDF vectorizers per text column with independently tuned parameters (max_features, ngram_range, min_df, max_df), combined via FeatureUnion Ordinal features: OrdinalEncoder with explicit category ordering for Likert and frequency scales Categorical features: Custom MultiSelectBinarizerPerColumn transformer handling comma-separated multi-select responses with parenthesis-aware splitting and unicode normalization

Models:

XGBoost with bagging to reduce prediction variance Feedforward neural network (PyTorch) with BatchNorm, dropout, and early stopping

Two-stage hyperparameter tuning:

TF-IDF parameters tuned via GridSearchCV with GroupKFold Model architecture (hidden units, dropout, learning rate, batch size) tuned via Bayesian optimization with Weights & Biases, evaluated on 5-fold GroupKFold mean validation accuracy

Evaluation: GroupKFold cross-validation splitting by user ID to prevent data leakage from shared response patterns. Neural network outperformed XGBoost by 6.5%. Repository Structure

  • preprocess2.ipynb # Full pipeline: preprocessing, model training, hyperparameter tuning
  • test_all_models.ipynb # Model comparison and evaluation across architectures
  • pred.py # Prediction script using trained pipeline artifacts
  • training_data_clean.csv # Preprocessed survey response dataset
  • report.pdf # Detailed methodology and results report
  • requirements.txt # Python dependencies

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors