NYU DS-GA 1012 final project; idea adopted from:
A. Iscen, G. Tolias, Y. Avrithis, O. Chum. "Label Propagation for Deep Semi-supervised Learning", CVPR 2019
Here is their implementation on image classification tasks written following Mean Teacher Pytorch implementation. Part of our implementation of label propagation is derived from theirs.
pythontorchsacremosestransformersscipypandasnumpyscikit-learn
We use Large Movie Review Dataset v1.0 for training and evaluation, which contains 50k labeled data. Here is a csv version of the same dataset.
Many supervised learning methods require a large amount of labeled data to achieve good accuracy, and in many tasks labeled data can be expensive to obtain (requires expensive human labor or human domain knowledge), while unlabeled data are available at a cheap cost. Thus, it is of practical interest to be able to leverage unlabeled data together with labeled data to reach performance comparable to that of supervised learning. Such methods, which we are interested in investigating, belong to semi-supervised learning.
In particular, we are interested in applying label propagation, a graph-based semi-supervised learning technique, to NLP/NLU task with deep learning models. For graph-based methods, all data, with or without label, are considered as vertices on a graph in a d-dimensional feature space. Label propagation regards all labeled data as “sources”, and assigns pseudo-labels to unlabeled data based on the cluster assumption that vertices that are close on the graph should have similar labels. Since “unlabeled” data are now given labels inferred from labeled data, we can use them for further supervised learning. Label propagation has a good performance in other areas of deep learning, and we are interested in its performance on NLP/NLU tasks.
nn.Embeddingwithvocab_size=10002- Bi-directional
GRUwith pre-trained fasttext word embeddings BERT
nn.Linear
nn.CrossEntropyLoss
torch.optim.Adam(params)
- Assign a small portion (5~10%) of the training data
Tas the labeled dataset,L = (x_1,x_2,...,x_l). Then remove the labels for the rest and call them the unlabeled dataset,U = (x_{l+1},x_{l+2},...,x+{l+u}). - Train a baseline model (e.x. 2-layer GRU with FC layer) on only
LforMepochs, whose performance acts as a lower bound. Train a fully supervised model onTforMepochs, whose performance acts as an upper bound. - Remove the FC layer from the baseline model to make it a feature extractor. Feed forward both
LandUto get hidden representationsV = (v_1,v_2,...,v_{l+u}). Do label propagation withVand assign/update the inferred labels ofU. - Train the model initialized with previous weights on both
LandUfor one epoch. - Repeat 3 and 4 for
Nepochs.
Download fasttext pre-trained word vectors
wget -P data_local https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zipunzip
import zipfile
with zipfile.ZipFile("data_local/wiki-news-300d-1M.vec.zip", 'r') as zip_ref:
zip_ref.extractall("data_local/")
and then finally
python make_data.py --num_labeled 4250 --model_type gruor
python make_data.py --num_labeled 4250 --model_type bert4250 is 10% of the available 42500 training examples (7500 validation examples thus 50k total).
python train_baseline.py \
--hidden_dim 64 \
--num_epochs 10 \
--name baseline \
--num_layers 2 \
--num_labeled 4250
--model_type gru or
python train_baseline.py \
--hidden_dim 32 \
--num_epochs 10 \
--name baseline \
--num_labeled 4250
--model_type bertpython train_fully_supervised.py \
--hidden_dim 64 \
--num_epochs 10 \
--name fully_supervised \
--num_layers 2 \
--model_type gru or
python train_fully_supervised.py \
--hidden_dim 32 \
--num_epochs 10 \
--name fully_supervised \
--model_type bert python train_phase2.py \
--total_epochs 99 \
--name phase2 \
--num_labeled 4250 \
--knn 100 \
--hidden_dim 32 \
--phase1_model_name baseline_bert \
--model_type bertor
python train_phase2.py \
--total_epochs 99 \
--name phase2 \
--num_labeled 4250 \
--knn 100 \
--hidden_dim 64 \
--num_layers 2 \
--phase1_model_name baseline_gru \
--model_type gruIf successful, we should see that the performance of this model lies between that of phase 1 model and the fully-supervised model. We can also test how phase 2 performance improves with more labeled data.