An experimental GPT-style language model built from scratch in PyTorch to generate novel, heat-stable thermophilic protein sequences.
- Decoder-only Transformer architecture.
- Custom Byte-Pair Encoding (BPE) tokenizer trained on a curated dataset of thermophilic proteins.
- Simple, educational codebase for training and generation.
-
Prepare the dataset and place it in the root folder.
-
Train the BPE tokenizer:
python bpe_vocab.py
-
Train the generative model:
python main.py
The script will print generated sequences upon completion.
This project is heavily inspired by Andrej Karpathy's educational work on nanoGPT.
