This is the coursework of COMP4132 Advanced Topics in Machine Learning (2024-2025).
In the project, we use 'shortjokes.csv', 'dev.csv' and their related versions. Due to file size limitations, we don't put the 'shortjokes.csv' in the final compressed package. You can get the 'shortjokes.csv' here.
See link.
This repository aims to train and test the joke generator and detector using the GPT2 and BERT models (see the section The inspiration for the project).
Another repository seeks to train and test the joke generator and detector using a Transformer-based model and pre-trained models using the Trainer provided by Huggingface.
- Open Spyder (You can open it through Anaconda).
- Open the code file.
- Install required libraries.
- Click the green play button to run the code.
- Wait and see the results.
In this repository, each code is run independently. Here is its structure using tree /f > tree.txt with manual adjustments of styles.
.
├─GPT2_generator_BERT_detector.py# the final version of joke generator (GPT2) and detector (BERT)
├─README.md
├─tree.txt# the tree structure of this repository
│
├─dataset # dataset used in this repository
│ ├─dev-middle.csv**
│ ├─dev-small.csv**
│ ├─dev.csv*
│ └─shortjokes.csv***
│
└─tmp_train_process # used during the project, not used in the final version
├─BERT_no_trainer.py**** # the initial version of trying to use BERT as the joke generator, without the detector
└─Generator_trainer.py**** # trying to use Trainer to train the GPT2 (taken as the joke generator), without the detector
Tip*: This dataset is cited from a paper The rJokes dataset: a large scale humor collection
@inproceedings{weller2020rjokes,
title={The rJokes dataset: a large scale humor collection},
author={Weller, Orion and Seppi, Kevin},
booktitle={Proceedings of the Twelfth Language Resources and Evaluation Conference},
pages={6136--6141},
year={2020}
}
Tip**: These two datasets are preprocessed by the group member Jiayu Zhang.
Tip***: It is provided by this module.
Tip****: (It is important if you want to run these two codes.)
Because these two codes were used before the final version and they need to use the BERT model downloaded to the local environment, I uploaded the compressed package to Baidu Netdisk, please download it through this link, or I recommend downloading the bert-base-cased model from the Hugging Face.
Here I list the required libraries for each code file. You can install these libraries through conda or pip.
- transformers==4.24.0
- cudatoolkit==11.3.1 *
- pytorch==1.12.1 *
- numpy==1.21.5
- pandas==1.3.5
- collections **
- tqdm==4.64.1 ***
- datasets==2.6.1 ****
- matplotlib==3.5.3 *****
Tip*: In order to use available GPU resources, according to tutorial, it is significant to install cudatoolkit library (we can consider it as the conda version of cuda) at first. After that, we should go to the official website of Pytorch to install pytorch and related libraries.
To be specific, I installed pytorch by using the following command with specific versions:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorchTip**: It is Python's built-in standard library.
Tip***: It aims to show the process bar during training and evaluating models.
Tip****: Use this library to create the datasets.
Tip*****: Aim to show images of losses for the generator and detector in the form of a pyplot-style.
- Besides,
Spyder's version is 5.3.3.
In this repository, I referred to the model-building framework of the Pytorch tutorial in the lab.
At the beginning of the project, I tried to use this code style because, during the whole module, this style gave me a clear understanding of the training and validation procedure, making the code more readable and structured.
However, it may be a good choice to use the Trainer provided by the Hugging Face to train the models, because it encapsulates the process of training the model, validating and evaluating the model, and we should only input the parameters that it needs.
As for me, this coding style is pretty good for letting users understand how the model is trained and how to adjust hyperparameters to achieve better performance. As a result, I kept this code style.
- The coding style
From this module's materials (labs and lectures)
Official GitHub repository of
BERT
Interpretation of the
BERTmodel (TWO pre-training tasks: Masked Language Model and Next Sentence Prediction)
How to download the pre-trained BERT model from Hugging Face. Realising the info (like parameters) of the
BertTokenizerclass.
- https://huggingface.co/docs/transformers/model_doc/bert#resources, https://huggingface.co/docs/transformers/model_doc/gpt2
Hugging face documentation, about
BERTmodel andGPT2model. There is no official PyTorch implementation version from Google. But the Hugging Face reimplemented it.
How to use
BERTfrom the Hugging Face Transformer library.
BERTuses WordPiece to tokenise.
The function of each pre-trained model of the
BERTmodels.
Text generation strategies from the Transformers library, including the parameters related to different strategies.
Fine-tuning the model using the
Trainer.
- Kenton, J. D. M. W. C., & Toutanova, L. K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT (Vol. 1, p. 2).
The paper of the
BERTmodel.
How to train the model for the Generator and Detector.
Randomly split the datasets.
The optimizer used in this project.
The difference between tokenizer() and tokenizer.encode()
- https://blog.csdn.net/qq_16555103/article/details/136805147, https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/text_generation#transformers.GenerationConfig
About model.generate().
Use the process bar in PyTorch.
See the batch size and overall number of samples of the DataLoader.
Perplexity.
Convert
booltoint.
Draw loss graphs.
Solve the problem: TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu().
Solve the problem: RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
Using
treeto generate the tree structure of the GitHub repository.
Python comment specification.
The paper of the
GPT-2model.