Deep Maximum Entropy Markov Model for NLP
An interactive web application showcasing real-time sentiment analysis using Deep Learning and NLP.
Try it now: Run ./run_demo.sh or python demo.py for a quick demo!
- Interactive Web Interface: Beautiful, modern UI with real-time sentiment analysis
- Word-by-Word Analysis: See sentiment classification for each word with confidence scores
- Visual Analytics: Animated charts and color-coded sentiment flow
- REST API: Clean API for integration into other projects
- Multiple Neural Architectures: MLP, BiLSTM, Word2Vec embeddings
# Install dependencies
pip install -r requirements.txt
# Run the server
./run_demo.shThen open http://localhost:5000 in your browser!
python demo.pyA comprehensive implementation of Maximum Entropy Markov Models (MEMM) using deep learning approaches for sentiment analysis and sequence tagging tasks. This project explores three different neural architectures for capturing contextual information in text sequences.
- Overview
- Background: Maximum Entropy Markov Models
- Model Architectures
- Data Flow & Preprocessing
- Training Process
- Results & Model Comparison
- Installation & Usage
- Project Structure
- Key Insights
This project implements three variants of deep learning models for sequence labeling, specifically targeting sentiment analysis with the following tags:
- T-POS: Positive sentiment
- T-NEG: Negative sentiment
- T-NEU: Neutral sentiment
- O: No sentiment (other)
The models combine traditional MEMM approaches with modern deep learning techniques to capture sequential dependencies and context in text data.
MEMMs are discriminative sequence models that predict each label conditioned on:
- Observations (words/features)
- Previous state (previous tag)
Unlike HMMs which model joint probability P(words, tags), MEMMs directly model conditional probability:
P(tag_i | word_i, tag_{i-1}, context)
This allows MEMMs to incorporate rich, overlapping features and avoid independence assumptions.
┌─────────────────────────────────────────────────────────────┐
│ MEMM Sequence Tagging │
└─────────────────────────────────────────────────────────────┘
Input Sentence: ["love", "this", "movie", "but", "hate", "ending"]
Step 1: Step 2: Step 3:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ "love" │ │ "this" │ │ "movie" │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
│ ┌─────┐ │ ┌─────┐ │ ┌─────┐
└──→│START│ └──→│T-POS│ └──→│T-POS│
└──┬──┘ └──┬──┘ └──┬──┘
│ │ │
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ T-POS │ │ T-POS │ │ T-NEU │
└───────┘ └───────┘ └───────┘
(Predicted) (Predicted) (Predicted)
Each prediction uses:
• Current word embedding
• Previous predicted tag
• Context words (n-gram or LSTM)
File: dmemm/mlp.py
This approach learns word embeddings from scratch during training.
┌──────────────────────────────────────────────────────────────────┐
│ MLP with Random Initialized Embeddings │
└──────────────────────────────────────────────────────────────────┘
Input: Bigram Context with Previous Tag
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
word_{i-1} word_i tag_{i-1}
│ │ │
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│Embedding│ │Embedding│ │ One-Hot │
│ Layer │ │ Layer │ │ Encoding │
│ (15-dim)│ │ (15-dim)│ │ (5-dim) │
└────┬────┘ └────┬────┘ └─────┬────┘
│ │ │
└──────┬───────┘ │
│ │
▼ │
┌────────────┐ │
│ Concatenate│◄───────────────┘
│ (30 + 5) │
└─────┬──────┘
│ 35-dimensional vector
▼
┌───────────┐
│ Linear │
│ (35→128) │
└─────┬─────┘
│
▼
┌───────────┐
│ ReLU │
└─────┬─────┘
│
▼
┌───────────┐
│ Linear │
│ (128→4) │
└─────┬─────┘
│
▼
┌───────────┐
│ LogSoftmax│
└─────┬─────┘
│
▼
[T-POS, T-NEG, T-NEU, O]
(Tag probabilities)
- Embedding Dimension: 15
- Context Size: Bigram (previous word + current word)
- Hidden Layer: 128 units
- Learns embeddings: Embeddings are randomly initialized and trained end-to-end
- Domain-specific vocabulary not in pre-trained models
- Twitter text, medical terminology, or specialized jargon
- When you have sufficient training data to learn good embeddings
File: dmemm/mlp-word2vec.py
This approach uses pre-trained Google News Word2Vec embeddings (300-dimensional).
┌──────────────────────────────────────────────────────────────────┐
│ MLP with Pre-trained Word2Vec │
└──────────────────────────────────────────────────────────────────┘
Pre-trained Word2Vec Model (GoogleNews-vectors-negative300.bin)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
↓ (Frozen weights)
word_{i-1} word_i tag_{i-1}
│ │ │
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│ Word2Vec│ │ Word2Vec│ │ One-Hot │
│ Lookup │ │ Lookup │ │ Encoding │
│(300-dim)│ │(300-dim)│ │ (5-dim) │
└────┬────┘ └────┬────┘ └─────┬────┘
│ │ │
│ If word not in vocab: │
│ use zero vector │
│ │ │
└──────┬───────┘ │
│ │
▼ │
┌────────────┐ │
│ Concatenate│◄───────────────┘
│(600 + 5) │
└─────┬──────┘
│ 605-dimensional vector
▼
┌───────────┐
│ Linear │
│ (605→300) │
└─────┬─────┘
│
▼
┌───────────┐
│ ReLU │
└─────┬─────┘
│
▼
┌───────────┐
│ Linear │
│ (300→300) │
└─────┬─────┘
│
▼
┌───────────┐
│ ReLU │
└─────┬─────┘
│
▼
┌───────────┐
│ Linear │
│ (300→4) │
└─────┬─────┘
│
▼
┌───────────┐
│ LogSoftmax│
└─────┬─────┘
│
▼
[T-POS, T-NEG, T-NEU, O]
(Tag probabilities)
Word Vocabulary Handling
━━━━━━━━━━━━━━━━━━━━━━━
Input Word
│
▼
┌─────────────────┐
│ Word in W2V? │
└────┬───────┬────┘
│Yes │No
│ │
▼ ▼
┌─────┐ ┌──────────┐
│ W2V │ │ Zero Vec │
│ Vec │ │ (300-dim)│
└─────┘ └──────────┘
│ │
└────┬────┘
│
▼
300-dim Vector
- Embedding Dimension: 300 (pre-trained)
- Word2Vec Model: GoogleNews-vectors-negative300 (first 50,000 words)
- Frozen Embeddings: Pre-trained vectors are not updated during training
- Out-of-vocabulary: Words not in Word2Vec get zero vectors
- Network Depth: 3 fully connected layers (300 → 300 → 4)
- Small datasets: Leverage knowledge from large corpora
- General vocabulary: Standard English words
- Best performance: According to results, this option performed best
File: dmemm/bilstm.py
This approach uses a Bidirectional LSTM to capture context from the entire sentence.
┌──────────────────────────────────────────────────────────────────┐
│ BiLSTM-MEMM Architecture │
└──────────────────────────────────────────────────────────────────┘
Input Sentence: [word_1, word_2, ..., word_n]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Step 1: Sentence Encoding
─────────────────────────
word_1 word_2 word_3 ... word_n
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Embed │ │Embed │ │Embed │ │Embed │
│15-dim│ │15-dim│ │15-dim│ ... │15-dim│
└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
│ │ │ │
└────┬────┴────┬────┴──────────────┘
│ │
▼ ▼
┌─────────────────────────────────┐
│ Bidirectional LSTM Layer │
│ │
│ Forward → → → → → → → │
│ │
│ ← ← ← ← ← ← ← Backward │
└────┬───┬───┬───────────┬────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────┐┌────┐ ┌────┐
│ h1 ││ h2 │ ... │ hn │ (hidden states, 10-dim)
└─┬──┘└─┬──┘ └─┬──┘
│ │ │
▼ ▼ ▼
┌──────┐┌──────┐ ┌──────┐
│Linear││Linear│ │Linear│
│10→6 ││10→6 │ │10→6 │
└──┬───┘└──┬───┘ └──┬───┘
│ │ │
▼ ▼ ▼
[feat1][feat2]...[featn] (features for each word)
Step 2: MEMM Scoring with Viterbi Decoding
──────────────────────────────────────────
For each position i, compute transition scores:
P(tag_i | features_i, tag_{i-1})
tag_{i-1} features_i
│ │
▼ ▼
┌─────────────────────┐
│ Transition Matrix │
│ + Feature Score │
└──────────┬──────────┘
│
▼
Score(tag_i)
Viterbi Algorithm:
─────────────────
Time: t=0 t=1 t=2 t=3
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
Tags: │START│ │ │ │ │ │STOP │
└──┬──┘ └──┬──┘ └──┬──┘ └─────┘
│ │ │
│ ┌───┼───┐ ┌───┼───┐
│ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼
┌────┐┌────┐┌────┐┌────┐┌────┐
│T-POS│T-NEG│T-NEU│ O │T-POS│ ...
└─┬──┘└─┬──┘└─┬──┘└─┬──┘└─┬──┘
│ │ │ │ │
Score│ Score │ Score │ Score │ ...
│ │ │ │ │
└─────┴─────┴─────┴─────┘
│
▼
Backtrack for best path
│
▼
[T-POS, T-POS, T-NEU, O, T-NEG]
(Final prediction)
Bidirectional LSTM Cell Processing
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
At each timestep t:
Forward Direction (→):
─────────────────────
h_{t-1} x_t
│ │
└──┬───┘
│
┌─────▼─────┐
│ LSTM Cell │
│ (forget, │
│ input, │
│ output │
│ gates) │
└─────┬─────┘
│
▼
h_t →
Backward Direction (←):
──────────────────────
h_{t+1} x_t
│ │
└──┬───┘
│
┌─────▼─────┐
│ LSTM Cell │
│ (forget, │
│ input, │
│ output │
│ gates) │
└─────┬─────┘
│
▼
← h_t
Combined:
────────
h_t → ⊕ ← h_t
│
▼
[Concatenated
bidirectional
hidden state]
│
▼
Feature vector
for position t
- Embedding Dimension: 15 (randomly initialized)
- Hidden Dimension: 10 (5 per direction)
- Bidirectional: Captures context from both left and right
- Viterbi Decoding: Finds optimal tag sequence using dynamic programming
- Transition Matrix: Learned conditional probabilities P(tag_i | tag_{i-1})
# For each word position, compute:
score = feature_score(word_i) + transition_score(tag_{i-1} → tag_i)
# The model learns:
# 1. Feature scores from BiLSTM
# 2. Transition probabilities between tags- Long-range dependencies: Captures context from entire sentence
- Better than n-grams: Not limited to fixed window size
- Structured prediction: Viterbi ensures globally consistent tag sequences
┌──────────────────────────────────────────────────────────────────┐
│ Data Processing Pipeline │
└──────────────────────────────────────────────────────────────────┘
1. Load Raw Data
━━━━━━━━━━━━━━━
train_set.pkl / test_set.pkl
│
▼
┌────────────────┐
│ List of dicts: │
│ { │
│ 'words': [...],│
│ 'ts_raw_tags':│
│ [....] │
│ } │
└───────┬────────┘
│
▼
2. Create Word-Tag Tuples
━━━━━━━━━━━━━━━━━━━━━━━━
([words], [tags])
│
▼
Example:
(['love', 'this', 'movie'], ['T-POS', 'T-POS', 'O'])
│
▼
3. Build N-gram Contexts (Options 1 & 2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bigram with Previous Tag:
([word_{i-1}, word_i, tag_{i-1}], tag_i)
│
▼
Example:
(['love', 'this', 'T-POS'], 'T-POS')
(['this', 'movie', 'T-POS'], 'O')
│
▼
4. Flatten & Split
━━━━━━━━━━━━━━━━━
All bigrams from all sentences
│
├──→ 80% Train
└──→ 20% Validation
│
▼
5. Convert to Tensors
━━━━━━━━━━━━━━━━━━━━
Options 1:
word → index → embedding
Option 2:
word → Word2Vec vector (300-dim)
Option 3:
sentence → indices → embeddings → BiLSTM
│
▼
6. Training
━━━━━━━━━━━
Batch processing → Forward pass → Loss → Backprop
Tag Encoding Scheme
━━━━━━━━━━━━━━━━━━━
Tag Name │ One-Hot Encoding │ Index
──────────┼──────────────────────┼────────
T-POS │ [1, 0, 0, 0, 0] │ 0
T-NEG │ [0, 1, 0, 0, 0] │ 1
T-NEU │ [0, 0, 1, 0, 0] │ 2
O │ [0, 0, 0, 1, 0] │ 3
<START> │ [0, 0, 0, 0, 1] │ 4
<STOP> │ N/A │ 5
Dataset Splits (80-20 split)
━━━━━━━━━━━━━━━━━━━━━━━━━━━
Original Sentences
│
▼
┌───────────────────┐
│ Generate n-grams │
│ (or full sent.) │
└────────┬──────────┘
│
▼
┌────────────────────┐
│ Shuffle & Split │
└─────┬──────────┬───┘
│ │
▼ ▼
Train Validation
(80%) (20%)
┌──────────────────────────────────────────────────────────────────┐
│ Training Loop (Per Epoch) │
└──────────────────────────────────────────────────────────────────┘
FOR each epoch:
│
├─→ FOR each training sample:
│ │
│ ├─→ 1. Prepare Input
│ │ ├─ Convert words to embeddings
│ │ ├─ Encode previous tag
│ │ └─ Create input tensor
│ │
│ ├─→ 2. Forward Pass
│ │ ├─ Pass through network
│ │ └─ Get log probabilities
│ │
│ ├─→ 3. Compute Loss
│ │ └─ NLL Loss between prediction and true tag
│ │
│ ├─→ 4. Backward Pass
│ │ ├─ Compute gradients
│ │ └─ Update parameters
│ │
│ └─→ 5. Track Loss
│
└─→ Return average epoch loss
All three options use Negative Log-Likelihood (NLL) Loss:
NLL Loss
━━━━━━━━
Given:
- Predicted log probabilities: [log P(T-POS), log P(T-NEG), log P(T-NEU), log P(O)]
- True tag: T-POS (index 0)
Loss = -log P(T-POS)
= -predicted_log_probs[0]
Goal: Minimize this loss
→ Maximize probability of correct tag
Optimizer Configurations
━━━━━━━━━━━━━━━━━━━━━━
Option 1 & 2:
┌────────────────────┐
│ Adam Optimizer │
│ lr = 0.001 │
│ β₁ = 0.9 │
│ β₂ = 0.999 │
└────────────────────┘
Option 3:
┌────────────────────┐
│ Adam Optimizer │
│ lr = 0.01 │
│ (higher for LSTM) │
└────────────────────┘
Learning Rate Comparison (Option 2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
lr = 0.001 (BEST): lr = 0.01: lr = 0.05:
Loss Loss Loss
│ │ │
90 ┤● 300┤ ● 20000┤●
│ ● │ ● ● │
80 ┤ ● 250┤ ● │
│ ● │ ● ● 15000┤
70 ┤ ● 200┤ ● │
│ ● │ ● ● │
60 ┤ ●● 150┤ ● 10000┤
│ ● │ ● ● ● │
50 ┤ ●● 100┤ ● │
│ ● │ ● ● 5000┤
40 ┤ ●● 50┤ ●● │
│ ● │ ● │
30 ┤ ●●●●●●●● 0┼─────────────────●●●●● 0┼──────────────────
└──────────────────── └───────────────── └─────────────────
0 10 20 30 Epochs 0 10 20 30 Epochs 0 10 20 30
Smooth convergence Oscillating Loss explosion
Stable learning Some instability then stabilizes low
OPTIMAL ✓ Acceptable Too high ✗
┌──────────────────────────────────────────────────────────────┐
│ Model Performance Comparison │
└──────────────────────────────────────────────────────────────┘
Model │ Embedding │ Context │ Performance
─────────────────────────┼──────────────┼────────────┼─────────────
Option 1: MLP Random │ 15-dim │ Bigram │ Moderate
│ (learned) │ (2 words) │
─────────────────────────┼──────────────┼────────────┼─────────────
Option 2: MLP + Word2Vec │ 300-dim │ Bigram │ ⭐ BEST
│ (pre-trained)│ (2 words) │
─────────────────────────┼──────────────┼────────────┼─────────────
Option 3: BiLSTM-MEMM │ 15-dim │ Full │ Lower
│ (learned) │ sentence │ (tuning issues)
Advantages of Pre-trained Embeddings (Option 2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Small Dataset + Pre-trained = Better Generalization
Embeddings
┌──────────────────┐
│ Word2Vec Model │ Trained on billions of words
│ (Google News) │ from Google News corpus
└────────┬─────────┘
│
│ Rich semantic representations
│ ("love" ≈ "enjoy", "great")
▼
┌──────────────────┐
│ Small Training │
│ Dataset │ Only thousands of sentences
└────────┬─────────┘
│
│ Fine-tune classifier, not embeddings
▼
┌──────────────────┐
│ Better │ Leverage world knowledge
│ Performance │ Less overfitting
└──────────────────┘
Random Embeddings (Option 1):
• Must learn word meanings from scratch
• Limited training data
• May overfit or underfit
BiLSTM (Option 3):
• More parameters to train
• Requires more data for optimal performance
• Complex architecture needs careful tuning
Evaluation Metrics for Sequence Tagging
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Prediction vs Ground Truth:
Predicted: [T-POS, T-POS, O, T-NEG, O ]
True: [T-POS, O, T-POS, T-NEG, O ]
──────────────────────────────────
✓ ✗ ✗ ✓ ✓
Metrics Computed:
─────────────────
True Positives (TP):
Predicted sentiment tag (not O) AND correct
False Positives (FP):
Predicted sentiment tag but was O, or wrong sentiment
False Negatives (FN):
Predicted O but should be sentiment tag
Precision = TP / (TP + FP)
→ Of predicted sentiments, how many were correct?
Recall = TP / (TP + FN)
→ Of actual sentiments, how many did we find?
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
→ Harmonic mean, balances precision and recall
# Python 3.7+
pip install torch
pip install numpy
pip install gensim
pip install scikit-learn
pip install tqdm
pip install matplotlibExpected Data Files:
━━━━━━━━━━━━━━━━━━
dmemm/
├── train_set.pkl # Training data (pickled)
├── test_set.pkl # Test data (pickled)
└── GoogleNews-vectors-negative300.bin # Word2Vec (for Option 2)
Data Format:
Each pickle file contains a list of dictionaries:
[
{
'words': ['word1', 'word2', ...],
'ts_raw_tags': ['T-POS', 'O', ...]
},
...
]
cd dmemm
python mlp.pyKey Parameters (edit in file):
EMBEDDING_DIM = 15- Dimension of learned embeddingsCONTEXT_SIZE = 3- Size of n-gram contextnum_epochs = 15- Number of training epochslearning_rate = 0.001- Adam optimizer learning rate
cd dmemm
# Download Word2Vec model first:
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# Place GoogleNews-vectors-negative300.bin in dmemm/
python mlp-word2vec.pyKey Parameters:
- Uses 300-dim Word2Vec embeddings (fixed)
- Loads first 50,000 words from Word2Vec
- Out-of-vocabulary words → zero vectors
cd dmemm
# Training mode:
python bilstm.py --load_model 0
# Evaluation mode (load saved model):
python bilstm.py --load_model 1Key Parameters:
EMBEDDING_DIM = 15- Dimension of learned embeddingsHIDDEN_DIM = 10- LSTM hidden size (5 per direction)--load_model- 0 for training, 1 to load saved model
Training Output:
━━━━━━━━━━━━━━━
• Loss curves per epoch
• Training progress with tqdm
• Final evaluation metrics (TP, FP, FN)
• Precision, Recall, F1 Score
Example:
Epoch 1/15: 100%|███████████| 45678/45678 [02:34<00:00]
Loss: 85.42
...
Epoch 15/15: 100%|██████████| 45678/45678 [02:31<00:00]
Loss: 28.15
Evaluation:
tp, fp, fn: 1234, 567, 234
Precision: 0.685
Recall: 0.841
F1: 0.755
dmemm/
├── app/ # Portfolio Web Application
│ ├── backend/
│ │ ├── app.py # Flask REST API
│ │ └── sentiment_analyzer.py # Inference module
│ └── frontend/
│ └── index.html # Interactive UI
├── dmemm/ # Original Research Implementations
│ ├── mlp.py # Option 1: Random embeddings
│ ├── mlp-word2vec.py # Option 2: Word2Vec embeddings
│ ├── bilstm.py # Option 3: BiLSTM-MEMM
│ ├── report.pdf # Detailed experimental results
│ │
│ ├── train_set.pkl # Training data (required)
│ ├── test_set.pkl # Test data (required)
│ └── GoogleNews-vectors-negative300.bin # Word2Vec model (required for Option 2)
├── demo.py # Quick CLI demo
├── run_demo.sh # One-click launcher
├── requirements.txt
└── saved_models/ # (created during training)
└── hw2-bilstm.pt # Saved BiLSTM model
This project implements sentiment analysis using:
- Maximum Entropy Markov Models (MEMM): Conditional probabilistic sequence model
- Neural Network Features: Deep learning for feature extraction
- Context Modeling: Considers word context and previous predictions
- Sentiment Classes: Positive (T-POS), Negative (T-NEG), Neutral (T-NEU, O)
- MLP with Random Init (
dmemm/mlp.py): 15-dim embeddings, 128-hidden units - Bi-LSTM MEMM (
dmemm/bilstm.py): Bidirectional LSTM with Viterbi decoding - MLP with Word2Vec (
dmemm/mlp-word2vec.py): 300-dim pre-trained embeddings
# POST /api/analyze
{
"text": "I love this amazing movie!"
}
# Response
{
"success": true,
"overall": {
"sentiment": "Positive",
"confidence": 0.92
},
"words": [...]
}- Real-time text analysis
- Color-coded word tags
- Sentiment probability bars
- Overall sentiment with confidence
- Individual word sentiments
- Confidence scores per word
- Emoji indicators
- Animated results
- Product review sentiment analysis
- Social media monitoring
- Customer feedback analysis
- Content moderation
- Market research
- Backend: Python 3.7+, Flask, PyTorch
- Frontend: Vanilla JavaScript, HTML5, CSS3
- ML: Neural Networks, Word Embeddings, Sequence Modeling
- NLP: Tokenization, Sentiment Classification, MEMM
- Inference: ~10-20ms per sentence
- Throughput: Hundreds of requests/second
- Memory: ~50MB model footprint
This project demonstrates:
- Deep Learning & NLP expertise
- Full-stack development (Flask + Frontend)
- REST API design
- Interactive data visualization
- Model deployment and inference optimization
- Clean, documented code
- Model comparison interface (MLP vs BiLSTM vs Word2Vec)
- Fine-tuning on custom datasets
- Batch processing for multiple texts
- Export results to CSV/JSON
- Docker containerization
- Cloud deployment (AWS, Heroku)
- Mobile-responsive improvements
See app/README.md for detailed development documentation.
Educational and portfolio project.
Based on Deep Maximum Entropy Markov Models for sequence labeling in NLP.
Built with PyTorch, Flask, and passion for NLP
Random vs Pre-trained Embeddings
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Small Dataset Scenario:
┌──────────────────────────────────────────────┐
│ │
│ Random Embeddings: │
│ • Must learn "love" means positive │
│ • Needs many examples │
│ • May not generalize well │
│ │
│ Pre-trained Embeddings: │
│ • Already knows "love" ≈ "enjoy" ≈ "great" │
│ • Semantic knowledge from billions of words │
│ • Better generalization ✓ │
│ │
└──────────────────────────────────────────────┘
Bigram (Options 1 & 2) vs Full Sentence (Option 3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Bigram Context:
"I [love this] movie"
└─┬─┘
2-word window
Pros: Simple, fewer parameters, faster
Cons: Limited context
BiLSTM Context:
"[I love this movie]"
└──────┬──────┘
Full sentence
Pros: Long-range dependencies, full context
Cons: More parameters, needs more data, slower
Why Model Previous Tags?
━━━━━━━━━━━━━━━━━━━━━━
Sentence: "love this movie but hate ending"
Without Previous Tag:
love → T-POS
this → ?
movie → ?
but → ?
hate → T-NEG
With Previous Tag (MEMM):
love → T-POS
this → T-POS (likely continues positive)
movie → T-POS (still in positive phrase)
but → O (transition word)
hate → T-NEG (negative)
MEMMs capture:
• Sentiment tends to span multiple words
• Transition patterns (POS → POS more likely than POS → NEG → POS)
• Sequential structure of language
Greedy vs Viterbi
━━━━━━━━━━━━━━━━━
Greedy (Options 1 & 2):
Predict each tag independently
→ May produce inconsistent sequences
Viterbi (Option 3):
Find globally optimal sequence
→ Consistent, respects transition probabilities
Example:
Greedy: [T-POS, O, T-POS, T-NEG, O, T-POS]
(inconsistent, jumpy)
Viterbi: [T-POS, T-POS, T-POS, O, T-NEG, T-NEG]
(smooth transitions, more realistic)
From the experimental results:
Learning Rate Impact
━━━━━━━━━━━━━━━━━━━
Too Low (< 0.001):
⊙ Slow convergence
⊙ May not reach optimal
Optimal (0.001):
⊙ Smooth convergence ✓
⊙ Stable training ✓
⊙ Best performance ✓
Too High (> 0.01):
⊙ Oscillating loss
⊙ May miss optimal
⊙ Can explode
Always tune:
• Learning rate
• Batch size
• Network architecture
• Embedding dimensions
• Number of epochs
Potential Enhancements
━━━━━━━━━━━━━━━━━━━━━
1. Contextualized Embeddings
├─ Replace Word2Vec with BERT/RoBERTa
└─ Dynamic representations per context
2. CRF Layer (instead of MEMM)
├─ Conditional Random Fields
└─ Model global sequence dependencies
3. Attention Mechanisms
├─ Weighted context aggregation
└─ Interpretable focus on important words
4. Data Augmentation
├─ Synonym replacement
├─ Back-translation
└─ Increase training data size
5. Ensemble Methods
├─ Combine all three options
└─ Voting or stacking
6. Multi-task Learning
├─ Joint training on related tasks
└─ Transfer learning from larger datasets
- Maximum Entropy Markov Models: McCallum et al. (2000)
- Word2Vec: Mikolov et al. (2013) - "Efficient Estimation of Word Representations"
- BiLSTM for Sequence Tagging: Graves & Schmidhuber (2005)
- Viterbi Algorithm: Viterbi (1967)
- PyTorch: https://pytorch.org/
Academic project for CS 577 - Natural Language Processing
Joshua Yeung
For questions or issues, please refer to the code documentation or the detailed report.pdf.
P(tag_sequence | word_sequence) = ∏ P(tag_i | tag_{i-1}, word_i, context_i)
i=1
where each local probability is modeled by a neural network:
P(tag_i | features) = exp(NN(features)_i) / Σ exp(NN(features)_j)
j
= softmax(NN(features))_i
Forward LSTM:
→h_t = LSTM_forward(embedding_t, →h_{t-1})
Backward LSTM:
←h_t = LSTM_backward(embedding_t, ←h_{t+1})
Combined:
h_t = [→h_t ; ←h_t] (concatenation)
Features:
f_t = W × h_t + b (linear projection to tag space)
Initialization:
π_0(START) = 0
π_0(tag) = -∞ for tag ≠ START
Recursion:
π_t(tag) = max[π_{t-1}(prev_tag) + score(prev_tag → tag) + feature(word_t, tag)]
prev_tag
Backtracking:
best_tag_T = argmax π_T(tag)
tag
Trace back through saved pointers to find optimal sequence
Happy Training! 🚀