A comparative study of Word2Vec word embeddings using Skip-gram and CBOW architectures, trained on the Brown corpus and evaluated against Google News pre-trained embeddings.
This project explores word embeddings by:
- Training Word2Vec models using both Skip-gram and CBOW methods
- Visualizing embeddings in 2D space using t-SNE dimensionality reduction
- Evaluating embedding quality using word similarity tasks with Pearson correlation
- Comparing custom-trained embeddings against Google News pre-trained vectors
- Python 3.8+
- nltk
- gensim
- numpy
- matplotlib
- scikit-learn
pip install -r requirements.txt.
├── train_embeddings.py # Main script for training and evaluation
├── visualization.py # Helper functions for t-SNE and plotting
├── word_pairs.txt # Custom word similarity dataset
├── requirements.txt # Python dependencies
└── README.md
python train_embeddings.pyThis will:
- Download and preprocess the Brown corpus
- Train Skip-gram and CBOW models
- Save embeddings to
skipgram_embeddings.txtandcbow_embeddings.txt - Visualize embeddings using t-SNE
- Evaluate models using Pearson correlation on word similarity task
- Find most similar words for sample queries
The word_pairs.txt file contains word pairs with human-judged similarity scores (0-1):
man woman 0.8
king queen 0.85
happy sad 0.3
...
The project compares three embedding sources:
- Skip-gram: Trained on Brown corpus (~1M words)
- CBOW: Trained on Brown corpus (~1M words)
- Google News: Pre-trained on Google News dataset (3B words, 300 dimensions)
Due to corpus size differences, Google News embeddings significantly outperform Brown corpus models:
| Model | Pearson Correlation | Notes |
|---|---|---|
| Google News | ~0.3 (positive) | Best performance |
| Skip-gram | ~-0.2 | Limited by corpus size |
| CBOW | ~0 | Limited by corpus size |
Despite low correlation scores, the Brown corpus models learn meaningful patterns:
- Semantically related words (man/woman/boy/girl) cluster together in t-SNE
- Some similar word queries return reasonable results
This demonstrates that Word2Vec requires large corpora (millions to billions of words) for robust semantic learning.
Word embeddings are reduced to 2D using t-SNE and plotted to visualize semantic relationships between words like:
- labour, chief, race, vow, learning
- sun, moon, book, music, coding
- friend, joy, flower, garden, rain
- travel, ocean, peace, love
Biswajeet Sahoo
MIT License