LP-SSL-text-classification

NYU DS-GA 1012 final project; idea adopted from:

A. Iscen, G. Tolias, Y. Avrithis, O. Chum. "Label Propagation for Deep Semi-supervised Learning", CVPR 2019

Here is their implementation on image classification tasks written following Mean Teacher Pytorch implementation. Part of our implementation of label propagation is derived from theirs.

Requirements

python
torch
sacremoses
transformers
scipy
pandas
numpy
scikit-learn

Data

We use Large Movie Review Dataset v1.0 for training and evaluation, which contains 50k labeled data. Here is a csv version of the same dataset.

Main idea

Many supervised learning methods require a large amount of labeled data to achieve good accuracy, and in many tasks labeled data can be expensive to obtain (requires expensive human labor or human domain knowledge), while unlabeled data are available at a cheap cost. Thus, it is of practical interest to be able to leverage unlabeled data together with labeled data to reach performance comparable to that of supervised learning. Such methods, which we are interested in investigating, belong to semi-supervised learning.

In particular, we are interested in applying label propagation, a graph-based semi-supervised learning technique, to NLP/NLU task with deep learning models. For graph-based methods, all data, with or without label, are considered as vertices on a graph in a d-dimensional feature space. Label propagation regards all labeled data as “sources”, and assigns pseudo-labels to unlabeled data based on the cluster assumption that vertices that are close on the graph should have similar labels. Since “unlabeled” data are now given labels inferred from labeled data, we can use them for further supervised learning. Label propagation has a good performance in other areas of deep learning, and we are interested in its performance on NLP/NLU tasks.

Experiments

Model specifics

Feature extractor

nn.Embedding with vocab_size=10002
Bi-directional GRU with pre-trained fasttext word embeddings
BERT

Classifier

nn.Linear

Criterion

nn.CrossEntropyLoss

Optimizer

torch.optim.Adam(params)

Training pipeline

Assign a small portion (5~10%) of the training data T as the labeled dataset, L = (x_1,x_2,...,x_l). Then remove the labels for the rest and call them the unlabeled dataset, U = (x_{l+1},x_{l+2},...,x+{l+u}).
Train a baseline model (e.x. 2-layer GRU with FC layer) on only L for M epochs, whose performance acts as a lower bound. Train a fully supervised model on T for M epochs, whose performance acts as an upper bound.
Remove the FC layer from the baseline model to make it a feature extractor. Feed forward both L and U to get hidden representations V = (v_1,v_2,...,v_{l+u}). Do label propagation with V and assign/update the inferred labels of U.
Train the model initialized with previous weights on both L and U for one epoch.
Repeat 3 and 4 for N epochs.

Run the full training pipeline

0. Preprocessing data

Download fasttext pre-trained word vectors

wget -P data_local https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

unzip

import zipfile
with zipfile.ZipFile("data_local/wiki-news-300d-1M.vec.zip", 'r') as zip_ref:
    zip_ref.extractall("data_local/")

and then finally

python make_data.py --num_labeled 4250 --model_type gru

or

python make_data.py --num_labeled 4250 --model_type bert

4250 is 10% of the available 42500 training examples (7500 validation examples thus 50k total).

1. Train baseline (phase 1)

python train_baseline.py \
    --hidden_dim 64 \
    --num_epochs 10 \
    --name baseline \
    --num_layers 2 \
    --num_labeled 4250
    --model_type gru

or

python train_baseline.py \
    --hidden_dim 32 \
    --num_epochs 10 \
    --name baseline \
    --num_labeled 4250
    --model_type bert

2. Train full supervised (upper bound)

python train_fully_supervised.py \
    --hidden_dim 64 \
    --num_epochs 10 \
    --name fully_supervised \
    --num_layers 2 \
    --model_type gru

or

python train_fully_supervised.py \
    --hidden_dim 32 \
    --num_epochs 10 \
    --name fully_supervised \
    --model_type bert

3. Train phase 2

python train_phase2.py \
    --total_epochs 99 \
    --name phase2 \
    --num_labeled 4250 \
    --knn 100 \
    --hidden_dim 32 \
    --phase1_model_name baseline_bert \
    --model_type bert

or

python train_phase2.py \
    --total_epochs 99 \
    --name phase2 \
    --num_labeled 4250 \
    --knn 100 \
    --hidden_dim 64 \
    --num_layers 2 \
    --phase1_model_name baseline_gru \
    --model_type gru

If successful, we should see that the performance of this model lies between that of phase 1 model and the fully-supervised model. We can also test how phase 2 performance improves with more labeled data.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
__pycache__		__pycache__
data_local		data_local
models		models
.gitignore		.gitignore
Iscen_Label_Propagation_for_Deep_Semi-Supervised_Learning_CVPR_2019_paper.pdf		Iscen_Label_Propagation_for_Deep_Semi-Supervised_Learning_CVPR_2019_paper.pdf
README.md		README.md
make_data.py		make_data.py
train_baseline.py		train_baseline.py
train_fully_supervised.py		train_fully_supervised.py
train_phase2.py		train_phase2.py
train_utils.py		train_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LP-SSL-text-classification

Requirements

Data

Main idea

Experiments

Model specifics

Feature extractor

Classifier

Criterion

Optimizer

Training pipeline

Run the full training pipeline

0. Preprocessing data

1. Train baseline (phase 1)

2. Train full supervised (upper bound)

3. Train phase 2

About

Uh oh!

Releases

Packages

Languages

jiangzian96/LP-SSL-text-classification

Folders and files

Latest commit

History

Repository files navigation

LP-SSL-text-classification

Requirements

Data

Main idea

Experiments

Model specifics

Feature extractor

Classifier

Criterion

Optimizer

Training pipeline

Run the full training pipeline

0. Preprocessing data

1. Train baseline (phase 1)

2. Train full supervised (upper bound)

3. Train phase 2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages