LyricGen is a sophisticated machine learning-based lyric generation tool that leverages a custom Transformer-based neural network to generate and complete song lyrics. The model processes multilingual datasets from Genius and generates contextually coherent lyrical content across three languages: English, French, and Arabic.
- Multilingual Support: Generates lyrics in English, French, and Arabic with language-specific tokenizers
- Advanced Data Preprocessing: Comprehensive text cleaning, filtering, and normalization
- Transformer Architecture: Custom implementation with multi-head attention, positional encoding, and layer normalization
- Interactive Generation: User-friendly prediction interface with customizable parameters
- Temperature Control: Adjustable creativity levels (0.3-1.2) for diverse output styles
- Performance Evaluation: BLEU score assessment for model quality
- Efficient Training: Optimized for Kaggle environment with ~27K training samples
Install the required Python libraries:
pip install pandas numpy scikit-learn nltk tensorflow matplotlibThe model uses the Genius Song Lyrics with Language Information dataset from Kaggle, containing song lyrics with metadata. The dataset undergoes extensive preprocessing including:
- Filtering for English, French, and Arabic lyrics only
- Removal of special characters, HTML tags, and structural markers
- Deduplication of redundant lyrics
- Language-specific text normalization
The model implements a custom Transformer-based architecture:
- Vocabulary Size: 15,000 words per language for comprehensive coverage
- Sequence Length: 50 tokens for optimal context window
- Embedding Dimension: 256
- Attention Heads: 8 multi-head attention mechanisms
- Feed-Forward Dimension: 512
- Total Parameters: Approximately 10-12M parameters
- Key Components:
- Positional encoding for sequence awareness
- Multi-head self-attention layers
- Layer normalization and dropout for regularization
- Language-specific tokenization
- Load and preprocess the Genius dataset
- Configure language-specific tokenizers (English, French, Arabic)
- Train the Transformer model with the prepared sequences
- Evaluate performance using BLEU scores
Use the interactive prediction function:
predict_next_lyrics(
seed_text="your starting lyrics here",
language='en', # 'en', 'fr', or 'ar'
num_words=8,
temperature=0.7 # 0.3-1.2 for creativity control
)Temperature Guidelines:
- 0.3-0.5: Conservative, predictable outputs
- 0.6-0.8: Balanced mode (recommended)
- 0.9-1.2: Creative, experimental outputs
- Language Filtering: Select English, French, and Arabic lyrics
- Text Cleaning:
- English & French: Lowercase conversion, punctuation removal
- Arabic: Preserve Unicode characters and original case
- Special Token Addition: Add
<sos>(start) and<eos>(end) markers - Tokenization: Language-specific vocabulary building with
<OOV>handling - Sequence Padding: Normalize to 50-token length
- Dataset Splitting: 70% training, 15% validation, 15% test
- Optimizer: Adam with learning rate scheduling
- Loss Function: Sparse categorical crossentropy
- Training Strategy: Autoregressive next-token prediction
- Data Augmentation: Language-aware processing
- Regularization: Dropout and layer normalization
The model's performance is evaluated using:
- BLEU Scores: Measure similarity between generated and reference lyrics
- Perplexity: Assess model confidence
- Language-Specific Metrics: Per-language performance analysis
- Handles right-to-left text (Arabic) and left-to-right languages (English, French)
- Maintains accented characters for French
- Optimized for computational efficiency on Kaggle
- Interactive interface for creative lyric exploration
- Suitable for songwriting assistance, creative writing, and language learning
- Songwriting Assistance: Generate continuation ideas for lyrics in progress
- Creative Writing: Explore different stylistic directions
- Language Learning: Study natural language patterns across languages
- Educational Demonstrations: Showcase modern NLP text generation capabilities
Eman Sarah Afi
Fall 2024