Skip to content

solitude6060/Go-Player-Prediction

Repository files navigation

Go Player Identification – Assignment 2 (Q5)

D13922024, Chung-Yao, Ma, NTU CSIE

This repository contains the trimmed-down implementation I used to solve the Go player identification task in assignment 2, question 5. The approach extracts style features from SGF records, trains gradient-boosted matching models, and provides a lightweight cosine-similarity baseline for inference.

Repository Layout

  • training.py – trains LightGBM and/or CatBoost pairwise matchers and serialises the feature extractor.
  • Q5.py – cosine-similarity inference script that reuses the saved feature extractor to create a submission file.
  • src/ – minimal package with the SGF parser, feature extractor, and model wrappers used by the scripts above.
  • dataset/ – expected dataset root (train_set, test_set/query_set, test_set/cand_set of SGF files).
  • models/ – output directory for checkpoints, features, and the cached feature_extractor.pkl.
  • requirements.txt – Python dependencies (install with pip install -r requirements.txt).

Setup

  1. Create and activate a Python 3.9+ environment.
  2. Install dependencies:
    pip install -r requirements.txt
  3. Place the official SGF data under dataset/ using the structure described above.

Training Workflow

Run training.py to fit the boosted models and persist the feature extractor.

python training.py --model lightgbm --dataset-dir dataset

Key arguments:

  • --model {lightgbm, catboost, both} – choose which matcher(s) to train.
  • --max-players – optionally limit the number of training players for quick experiments.
  • --negative-ratio – ratio of negative to positive pairs when generating contrastive samples.
  • --seed – random seed used for feature pairing and data shuffling.

Outputs produced in models/:

  • feature_extractor.pkl – pickled GoFeatureExtractor reused during inference.
  • train_features*.npy / train_ids*.npy – cached feature matrices and player IDs.
  • lightgbm_model.txt or catboost_model.cbm – saved boosters.

Re-running training will detect and reuse an existing feature_extractor.pkl when possible.

Inference (Q5 Cosine Matcher)

Q5.py loads the cached feature extractor, recomputes query/candidate embeddings, and performs cosine matching with configurable normalisation.

python Q5.py \
  --dataset-dir dataset \
  --feature-transform pca_whiten \
  --transform-components 200 \
  --normalize robust \
  --normalization-scope train \
  --output submission.csv

Normalisation choices: standard, robust, minmax, l2, or none. Feature transforms include none, pca, and pca_whiten with an optional component budget, and the scaler can be fitted on the query set, all test data, cached train features, or chosen automatically. Feature-level LightGBM importances (--feature-weights) remain available when no latent transform is applied. The script prints similarity statistics and writes a Kaggle-ready CSV sorted by query ID.

Inference Parameters

  • --dataset-dir: root folder that holds train_set, test_set/query_set, and test_set/cand_set; point it to custom data drops if you run outside the repo default.
  • --feature-transform: latent transform to reduce or decorrelate features before similarity (none, pca, pca_whiten). The white variant normalizes principal components for cosine scoring.
  • --transform-components: number of PCA components to keep; any value <=0 keeps the full dimensionality so you can sweep 128/200/256 without retraining.
  • --normalize: row-wise feature scaling applied after the optional transform. robust trims outliers, standard standardizes to zero mean, minmax forces [0,1], l2 normalizes vector length, and none skips the step.
  • --normalization-scope: decides which data are used to fit the scaler (query, all, train, or auto). auto prefers cached train features when available and otherwise falls back to all test data.
  • --feature-weights: enables LightGBM-derived importance weights (auto, gbm_gain, gbm_split) when staying in the original feature space; use none to skip weighting or when combining with PCA.
  • --rerank-topk: activates the LightGBM pairwise booster on the top-k cosine candidates per query; set to 0 to rely on cosine only.
  • --rerank-model: path to the saved LightGBM booster (default models/lightgbm_model.txt) used during the reranking pass.
  • --output: filename for the submission CSV. The script writes a Kaggle-ready table sorted by query id and will overwrite the target file if it already exists.

Notes

  • models/feature_extractor.pkl must exist before running Q5.py; generate it via training.py if missing.
  • The remaining files in src/ are the only modules required by the current training and inference code; unused experimental utilities have been removed for clarity.

About

NTU Machine Learning Class Fall 2025 Assignment 2 Q5

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages