From 4873ee6aaadbc77601cc80219d77473f918e6be6 Mon Sep 17 00:00:00 2001 From: Manuel Romero Date: Wed, 13 Sep 2023 00:40:48 +0200 Subject: [PATCH] Fix typo --- experiments/language_model/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/experiments/language_model/README.md b/experiments/language_model/README.md index 3644ca8..b3ca2ae 100644 --- a/experiments/language_model/README.md +++ b/experiments/language_model/README.md @@ -2,7 +2,7 @@ ## Data -We use [wiki103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) data as example, which is publicly available. It contains three text files, `train.txt`, `valid.txt` and `text.txt`. We use `train.txt` to train the model and `valid.txt` to evalute the intermeidate checkpoints. We first need to run `prepara_data.py` to tokenize these text files. We concatenate all documents into a single text and split it into lines of tokens, while each line has at most 510 token (2 tokens are left to special tokens `[CLS]` and `[SEP]`). +We use [wiki103](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip) data as example, which is publicly available. It contains three text files, `train.txt`, `valid.txt` and `text.txt`. We use `train.txt` to train the model and `valid.txt` to evalute the intermeidate checkpoints. We first need to run `prepare_data.py` to tokenize these text files. We concatenate all documents into a single text and split it into lines of tokens, while each line has at most 510 token (2 tokens are left to special tokens `[CLS]` and `[SEP]`). ## Pre-training with Masked Language Modeling task