Skip to content

Bsahoo99/word2vec-skipgram-vs-cbow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec Embeddings: Skip-gram vs CBOW

A comparative study of Word2Vec word embeddings using Skip-gram and CBOW architectures, trained on the Brown corpus and evaluated against Google News pre-trained embeddings.

Overview

This project explores word embeddings by:

  • Training Word2Vec models using both Skip-gram and CBOW methods
  • Visualizing embeddings in 2D space using t-SNE dimensionality reduction
  • Evaluating embedding quality using word similarity tasks with Pearson correlation
  • Comparing custom-trained embeddings against Google News pre-trained vectors

Requirements

  • Python 3.8+
  • nltk
  • gensim
  • numpy
  • matplotlib
  • scikit-learn

Installation

pip install -r requirements.txt

Project Structure

.
├── train_embeddings.py      # Main script for training and evaluation
├── visualization.py         # Helper functions for t-SNE and plotting
├── word_pairs.txt           # Custom word similarity dataset
├── requirements.txt         # Python dependencies
└── README.md

Usage

Train Word2Vec Models

python train_embeddings.py

This will:

  1. Download and preprocess the Brown corpus
  2. Train Skip-gram and CBOW models
  3. Save embeddings to skipgram_embeddings.txt and cbow_embeddings.txt
  4. Visualize embeddings using t-SNE
  5. Evaluate models using Pearson correlation on word similarity task
  6. Find most similar words for sample queries

Word Similarity Dataset

The word_pairs.txt file contains word pairs with human-judged similarity scores (0-1):

man     woman    0.8
king    queen    0.85
happy   sad      0.3
...

Results

The project compares three embedding sources:

  • Skip-gram: Trained on Brown corpus (~1M words)
  • CBOW: Trained on Brown corpus (~1M words)
  • Google News: Pre-trained on Google News dataset (3B words, 300 dimensions)

Expected Performance

Due to corpus size differences, Google News embeddings significantly outperform Brown corpus models:

Model Pearson Correlation Notes
Google News ~0.3 (positive) Best performance
Skip-gram ~-0.2 Limited by corpus size
CBOW ~0 Limited by corpus size

Despite low correlation scores, the Brown corpus models learn meaningful patterns:

  • Semantically related words (man/woman/boy/girl) cluster together in t-SNE
  • Some similar word queries return reasonable results

This demonstrates that Word2Vec requires large corpora (millions to billions of words) for robust semantic learning.

Visualization

Word embeddings are reduced to 2D using t-SNE and plotted to visualize semantic relationships between words like:

  • labour, chief, race, vow, learning
  • sun, moon, book, music, coding
  • friend, joy, flower, garden, rain
  • travel, ocean, peace, love

Author

Biswajeet Sahoo

License

MIT License

About

A comparative study of Word2Vec embeddings using Skip-gram and CBOW architectures, trained on Brown corpus and evaluated against Google News pre-trained vectors

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages