Skip to content

Akash671/OpenAI-GPT2-Model-Fine-Tuning

Repository files navigation

Fine-tuned GPT-2 Model for News Article Generation This repository contains a GPT-2 model that has been fine-tuned for generating short news articles based on headlines. This model is designed to produce text in a style similar to concise news reports, making it suitable for quick content generation or as a starting point for more specialized generative AI tasks.

📝 Description This model is a version of GPT-2 (gpt2), a popular open-source language model developed by OpenAI, further trained on a custom dataset. The original GPT-2 model is excellent at generating coherent text, and this fine-tuned version specializes in expanding news headlines into brief, coherent articles.

📊 Dataset The model was fine-tuned using a synthetic dataset of 1000 examples. Each example consists of a prompt (a news headline) and a completion (a short news article related to the headline). The dataset was created to demonstrate the fine-tuning process for text generation on a specific domain.

Example Data Format:

{"prompt": "Tech Innovation Boosts Economy (Generated Sample 1)", "completion": "A groundbreaking new technology in renewable energy has been announced, promising to significantly boost the national economy and create thousands of jobs. Experts predict a rapid adoption rate across various industries. This update is part of the daily news brief. (Generated Sample 1)"}

(This data is saved in fake_text_gen_dataset.jsonl.)

⚙️ Fine-tuning Process The fine-tuning was performed using the Hugging Face transformers library in Python.

Base Model: gpt2

Tokenization: The AutoTokenizer from Hugging Face was used. A crucial step was setting the tokenizer's pad_token to the eos_token (End-Of-Sequence token), as GPT-2 doesn't have a default padding token. This helps in batch processing during training.

Data Preprocessing: For each training example, the prompt and completion were concatenated into a single sequence, followed by the eos_token (e.g., "{prompt}{completion}{eos_token}"). The model then learns to predict the next token in this combined sequence.

Training: The Trainer API from Hugging Face was utilized. This handles the training loop, including:

Loss Function: The model optimizes for Causal Language Modeling (CLM) loss, which is essentially predicting the next token in a sequence.

Optimizer: AdamW.

Epochs: The model was trained for 3 epochs.

Batch Size: Typically 4-8 samples per batch, depending on available memory.

Evaluation: The model's performance was evaluated on a validation set at the end of each epoch using eval_loss.

🚀 How to Use the Model (Inference) To use this fine-tuned model for generating text, you'll need Python and the Hugging Face transformers and torch libraries.

Prerequisites pip install transformers torch datasets

Loading and Generating Text from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM import torch import os

Define the path where your fine-tuned model and tokenizer are saved

model_path = "./fine_tuned_gpt2_textgen" # Ensure this path is correct

Load the tokenizer and model

tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained(model_path)

Ensure the tokenizer has a padding token defined

if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model.config.pad_token_id = tokenizer.eos_token_id

Create a text generation pipeline

generator = pipeline( 'text-generation', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1 # Use GPU if available, else CPU )

Example prompt

prompt = "Tech Innovation Boosts Economy (Generated Sample 1)"

Generate text

generated_text = generator( prompt, max_length=100, # Max total length of the output (prompt + generated) num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, # Crucial for generation do_sample=True, # Enable sampling for more diverse output top_k=50, # Consider top 50 tokens for sampling temperature=0.7 # Controls randomness (lower is more predictable) )[0]['generated_text']

print(f"Prompt: {prompt}") print(f"Generated: {generated_text}")

Tip: You might need to post-process the generated text

to remove the input prompt or any trailing special tokens.

Model Files The zipped model (fine_tuned_gpt2_model.zip) contains the following key files:

pytorch_model.bin: The learned weights of the fine-tuned GPT-2 model.

config.json: The model's configuration.

vocab.json, merges.txt, tokenizer.json: Files necessary for the tokenizer to work.

generation_config.json: Configuration for text generation.

✨ Future Improvements Larger, Real-World Dataset: Fine-tuning on a more extensive and diverse dataset of actual news articles would significantly improve the model's quality and ability to generalize.

Evaluation Metrics: Implementing metrics like BLEU or ROUGE scores to quantitatively evaluate the quality of generated text.

Advanced Prompt Engineering: Exploring more sophisticated prompt structures during training to guide generation better.

Longer Text Generation: Adjusting max_length and exploring techniques for generating multi-paragraph articles.

📄 License This project is open-source and available under the MIT License. You are free to use, modify, and distribute the code for personal and commercial purposes.

About

OpenAI GPT2 Model Fine Tuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors