A pytorch implementation of the word-level recurrent neural network for sentence completion. The code is based on Word-level language modeling RNN, and importance sampling module is from PyTorch Large-Scale Language Model.
- torchvision >= 0.2.0
- torch >= 0.3.0.post4
- numpy >= 1.13.3
- pandas >= 0.21.0
- nltk >= 3.2.5
- tqdm >= 4.19.5
- Cython >= 0.27.3
pip3 install -r requirements.txt
- Build Log_Uniform Sampler according to Link.
- Download
punktpackage innltk.
- Microsoft Research Sentence Completion Challenge -
Training and Test dataset can be downloaded from Link. Store the downloaded test data in
./data/completion/. - Scholastic Aptitude Test sentence completion questions -
Collected questions are provided in
./data/completion/SAT_set_filled.csv. - Nineteenth century novels (19C novels) -
Extract
./data/prepro/guten.tgzof preprocessed files. - One Billion Word Benchmark (1B word) - Link
python3 train.py --cuda --save_dir mynet
Default arguments are set for training with 19C novels. Argument settings for training with the 1B word benchmark are presented in the following table.
| Argument | 19C novels | 1B word |
|---|---|---|
| corpus | guten | gbw |
| emsize | 200 | 500 |
| nhid | 600 | 2000 |
| outsize | 400 | 500 |
| lr | 0.5 | 1.0 |
| decay_after | 5 | 1 |
| decay_rate | 0.5 | 0.8 |
| batch_size | 20 | 100 |
| nsampled | -1 | 8192 |
python3 sent_cmplt.py --cuda --save_dir mynet
| corpus | bidirec | MSR accuracy | SAT accuracy |
|---|---|---|---|
| guten | False | 69.4 (0.8)* | 29.6 (1.5)* |
| guten | True | 72.3 (1.1)* | 33.3 (2.0)* |
| gbw | False | 63.2 | 66.5 |
| gbw | True | 64.1 | 69.1 |
*The mean accuracy of five networks trained with different random initializations is shown with the standard deviation in parentheses.