diff --git a/benchmark-tasks.rst b/benchmark-tasks.rst index 72e8e43..4e502fb 100644 --- a/benchmark-tasks.rst +++ b/benchmark-tasks.rst @@ -95,7 +95,7 @@ Image classification is one of the most important problems in computer vision an Besides, in each round workers access disjoint set of datapoints. -Implementation details: +**Implementation details:** #. **Data Preprocessing** We followed the same approach described in :cite:`DBLP:journals/corr/HeZRS15`. @@ -180,7 +180,7 @@ Task 3: Language Modelling """"""""""""""""""""""" #. **Model** - We benchmark the `AWD-LSTM `_ model. + We benchmark the ASGD Weight-Dropped LSTM (`AWD-LSTM `_) model. #. **Dataset** The `Wikitext2 `_ dataset is used. @@ -225,6 +225,46 @@ Task 3: Language Modelling The bandwidth between two nodes is around 7.5Gbit/s. ``MPI``, ``GLOO`` or `NCCL` are used for communication. +.. _benchmark-task-3b: + +3a. BERT, Wikidump-20200101 +""""""""""""""""""""""""""" +#. **Model** + TODO +#. **Dataset** + TODO + +#. **Training Algorithm** + TODO + +**Implementation details:** + +#. **Data Preprocessing** + The data needs to be downloaded and pre-processed using the pre-processing script + ``mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/download_dataset.sh `` before training. + The raw dataset is available on our S3 `here `_, + as well as the pre-processed data here https://storage.googleapis.com/mlbench-datasets/wikidump/processed/part-00XXX-of-00500, + where `XXX` goes from `000` to `500`. + + After pre-processing, the training data needs to be created using the script + ``mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/create_pretraining_data.py``. + Please run it using the following command (for each of the 500 files) + + .. code-block:: bash + + $ cd mlbench_core/dataset/nlp/pytorch/wikidump/preprocess/ + + $ python3 create_pretraining_data.py \ + --input_file=/part-XX-of-00500 \ + --output_file=/part-XX-of-00500 \ + --vocab_file=vocab.txt \ + --do_lower_case=True \ + --max_seq_length=512 \ + --max_predictions_per_seq=76 \ + --masked_lm_prob=0.15 \ + --random_seed=12345 \ + --dupe_factor=10 + Task 4: Machine Translation ^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -278,7 +318,7 @@ Task 4: Machine Translation + ``scale_window = 128`` (steps after upscale if no overflow/underflow) -Implementation details: +**Implementation details:** #. **Data Preprocessing** The data needs to be downloaded and pre-processed and tokenized using the pre-processing script @@ -371,7 +411,7 @@ Implementation details: + ``scale_window = 2000`` (steps after upscale if no overflow/underflow) -Implementation details: +**Implementation details**: #. **Data Preprocessing** The data needs to be downloaded and pre-processed and tokenized using the pre-processing script