All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Increased minimum versions of dependencies (
transformersto 4.3.0,pytorch-lightningto 1.2.0)- Remove dependency on
tokenizersastransformerspins it.
- Remove dependency on
- Made Fast tokenizers the default (as it is the default in
transformers4.0.0) - Made serialized tokenizers the default for custom tokenizers, and added support for loading them for both
aitextgenandTokenDatasets - Added gradient checkpointing for GPT-2, and set it to the default for training 355M and 774M.
- Added layer freezing to freeze the first
nlayers of GPT-2 while training. This allows 1.5B GPT-2 to be trained with a highn. - Added schema-based generation for specificed schema_tokens (which can be encoded in the Transformers config). This can be used with an appropriate dataset for schema-based generation.
- Switched TensorFlow weight download URL from GCP (as OpenAI removed it from there) to Azure
- Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
- Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
- Added an
lstripparam to generation, which strips all whitespace at the beginning of generated text (related to point above)
- Increased minimum versions of dependencies (
transformersto 4.0.0,pytorch-lightningto 1.0.8, Pytorch to 1.6) - Fixed imports to account for new Transfomers file architecture
- Fixed training to account for new transformer/pytorch-lightning minimums
- Fully removed TorchScript code (ONNX implementation will supercede it)
- Made prompt specification for generation more canonical with Transformers
- Set default
vocabsize for new tokenizers to1000 - Began work on serializing tokenizers in accordance to the new
tokenizersapproach
- CHANGELOG.md
- Progress bar for loading a dataset.
progress_bar_refresh_rateparameter fortrain()andTokenDataset().
- Set numpy data store for
TokenDataset. - Set single-text files to be loaded delimited as newlines.
shuffleandseedparameters for TokenDataset.
- Set
generate()defaults tomax_length=256andtemperature=0.7. - Added to docs notes about GPT-2 maximum length of 1024.
- Everything!