D13922024, Chung-Yao, Ma, NTU CSIE
This repository contains the trimmed-down implementation I used to solve the Go player identification task in assignment 2, question 5. The approach extracts style features from SGF records, trains gradient-boosted matching models, and provides a lightweight cosine-similarity baseline for inference.
training.py– trains LightGBM and/or CatBoost pairwise matchers and serialises the feature extractor.Q5.py– cosine-similarity inference script that reuses the saved feature extractor to create a submission file.src/– minimal package with the SGF parser, feature extractor, and model wrappers used by the scripts above.dataset/– expected dataset root (train_set,test_set/query_set,test_set/cand_setof SGF files).models/– output directory for checkpoints, features, and the cachedfeature_extractor.pkl.requirements.txt– Python dependencies (install withpip install -r requirements.txt).
- Create and activate a Python 3.9+ environment.
- Install dependencies:
pip install -r requirements.txt
- Place the official SGF data under
dataset/using the structure described above.
Run training.py to fit the boosted models and persist the feature extractor.
python training.py --model lightgbm --dataset-dir datasetKey arguments:
--model {lightgbm, catboost, both}– choose which matcher(s) to train.--max-players– optionally limit the number of training players for quick experiments.--negative-ratio– ratio of negative to positive pairs when generating contrastive samples.--seed– random seed used for feature pairing and data shuffling.
Outputs produced in models/:
feature_extractor.pkl– pickledGoFeatureExtractorreused during inference.train_features*.npy/train_ids*.npy– cached feature matrices and player IDs.lightgbm_model.txtorcatboost_model.cbm– saved boosters.
Re-running training will detect and reuse an existing feature_extractor.pkl when possible.
Q5.py loads the cached feature extractor, recomputes query/candidate embeddings, and performs cosine matching with configurable normalisation.
python Q5.py \
--dataset-dir dataset \
--feature-transform pca_whiten \
--transform-components 200 \
--normalize robust \
--normalization-scope train \
--output submission.csvNormalisation choices: standard, robust, minmax, l2, or none. Feature transforms include none, pca, and pca_whiten with an optional component budget, and the scaler can be fitted on the query set, all test data, cached train features, or chosen automatically. Feature-level LightGBM importances (--feature-weights) remain available when no latent transform is applied. The script prints similarity statistics and writes a Kaggle-ready CSV sorted by query ID.
--dataset-dir: root folder that holdstrain_set,test_set/query_set, andtest_set/cand_set; point it to custom data drops if you run outside the repo default.--feature-transform: latent transform to reduce or decorrelate features before similarity (none,pca,pca_whiten). The white variant normalizes principal components for cosine scoring.--transform-components: number of PCA components to keep; any value <=0 keeps the full dimensionality so you can sweep 128/200/256 without retraining.--normalize: row-wise feature scaling applied after the optional transform.robusttrims outliers,standardstandardizes to zero mean,minmaxforces [0,1],l2normalizes vector length, andnoneskips the step.--normalization-scope: decides which data are used to fit the scaler (query,all,train, orauto).autoprefers cached train features when available and otherwise falls back to all test data.--feature-weights: enables LightGBM-derived importance weights (auto,gbm_gain,gbm_split) when staying in the original feature space; usenoneto skip weighting or when combining with PCA.--rerank-topk: activates the LightGBM pairwise booster on the top-k cosine candidates per query; set to 0 to rely on cosine only.--rerank-model: path to the saved LightGBM booster (defaultmodels/lightgbm_model.txt) used during the reranking pass.--output: filename for the submission CSV. The script writes a Kaggle-ready table sorted by query id and will overwrite the target file if it already exists.
models/feature_extractor.pklmust exist before runningQ5.py; generate it viatraining.pyif missing.- The remaining files in
src/are the only modules required by the current training and inference code; unused experimental utilities have been removed for clarity.