Go Player Identification – Assignment 2 (Q5)

D13922024, Chung-Yao, Ma, NTU CSIE

This repository contains the trimmed-down implementation I used to solve the Go player identification task in assignment 2, question 5. The approach extracts style features from SGF records, trains gradient-boosted matching models, and provides a lightweight cosine-similarity baseline for inference.

Repository Layout

training.py – trains LightGBM and/or CatBoost pairwise matchers and serialises the feature extractor.
Q5.py – cosine-similarity inference script that reuses the saved feature extractor to create a submission file.
src/ – minimal package with the SGF parser, feature extractor, and model wrappers used by the scripts above.
dataset/ – expected dataset root (train_set, test_set/query_set, test_set/cand_set of SGF files).
models/ – output directory for checkpoints, features, and the cached feature_extractor.pkl.
requirements.txt – Python dependencies (install with pip install -r requirements.txt).

Setup

Create and activate a Python 3.9+ environment.
Install dependencies:
```
pip install -r requirements.txt
```
Place the official SGF data under dataset/ using the structure described above.

Training Workflow

Run training.py to fit the boosted models and persist the feature extractor.

python training.py --model lightgbm --dataset-dir dataset

Key arguments:

--model {lightgbm, catboost, both} – choose which matcher(s) to train.
--max-players – optionally limit the number of training players for quick experiments.
--negative-ratio – ratio of negative to positive pairs when generating contrastive samples.
--seed – random seed used for feature pairing and data shuffling.

Outputs produced in models/:

feature_extractor.pkl – pickled GoFeatureExtractor reused during inference.
train_features*.npy / train_ids*.npy – cached feature matrices and player IDs.
lightgbm_model.txt or catboost_model.cbm – saved boosters.

Re-running training will detect and reuse an existing feature_extractor.pkl when possible.

Inference (Q5 Cosine Matcher)

Q5.py loads the cached feature extractor, recomputes query/candidate embeddings, and performs cosine matching with configurable normalisation.

python Q5.py \
  --dataset-dir dataset \
  --feature-transform pca_whiten \
  --transform-components 200 \
  --normalize robust \
  --normalization-scope train \
  --output submission.csv

Normalisation choices: standard, robust, minmax, l2, or none. Feature transforms include none, pca, and pca_whiten with an optional component budget, and the scaler can be fitted on the query set, all test data, cached train features, or chosen automatically. Feature-level LightGBM importances (--feature-weights) remain available when no latent transform is applied. The script prints similarity statistics and writes a Kaggle-ready CSV sorted by query ID.

Inference Parameters

--dataset-dir: root folder that holds train_set, test_set/query_set, and test_set/cand_set; point it to custom data drops if you run outside the repo default.
--feature-transform: latent transform to reduce or decorrelate features before similarity (none, pca, pca_whiten). The white variant normalizes principal components for cosine scoring.
--transform-components: number of PCA components to keep; any value <=0 keeps the full dimensionality so you can sweep 128/200/256 without retraining.
--normalize: row-wise feature scaling applied after the optional transform. robust trims outliers, standard standardizes to zero mean, minmax forces [0,1], l2 normalizes vector length, and none skips the step.
--normalization-scope: decides which data are used to fit the scaler (query, all, train, or auto). auto prefers cached train features when available and otherwise falls back to all test data.
--feature-weights: enables LightGBM-derived importance weights (auto, gbm_gain, gbm_split) when staying in the original feature space; use none to skip weighting or when combining with PCA.
--rerank-topk: activates the LightGBM pairwise booster on the top-k cosine candidates per query; set to 0 to rely on cosine only.
--rerank-model: path to the saved LightGBM booster (default models/lightgbm_model.txt) used during the reranking pass.
--output: filename for the submission CSV. The script writes a Kaggle-ready table sorted by query id and will overwrite the target file if it already exists.

Notes

models/feature_extractor.pkl must exist before running Q5.py; generate it via training.py if missing.
The remaining files in src/ are the only modules required by the current training and inference code; unused experimental utilities have been removed for clarity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go Player Identification – Assignment 2 (Q5)

Repository Layout

Setup

Training Workflow

Inference (Q5 Cosine Matcher)

Inference Parameters

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
models		models
src		src
.gitignore		.gitignore
Q5.py		Q5.py
README.md		README.md
requirements.txt		requirements.txt
submission.csv		submission.csv
submission_robust.csv		submission_robust.csv
training.py		training.py

solitude6060/Go-Player-Prediction

Folders and files

Latest commit

History

Repository files navigation

Go Player Identification – Assignment 2 (Q5)

Repository Layout

Setup

Training Workflow

Inference (Q5 Cosine Matcher)

Inference Parameters

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages