Skip to content

nsx07/word2vec-cbow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

CBOW Model for Word Embeddings

This repository contains a Python implementation of the Continuous Bag of Words (CBOW) model for learning word embeddings. This implementation is intended for a college homework assignment.

Table of Contents

Introduction

The Continuous Bag of Words (CBOW) model is a neural network-based approach for learning word embeddings. In CBOW, the context words are used to predict the target word. This implementation allows you to train a CBOW model on a given text corpus and learn embeddings for each word in the vocabulary.

Implementation Details

The CBOW model in this repository is implemented in Python using NumPy. The key components of the implementation include:

  • Sentence Separation: The input text is split into sentences based on punctuation.
  • Vocabulary Creation: A vocabulary is created from the training data, mapping each word to a unique index.
  • One-Hot Encoding: Words are converted into one-hot encoded vectors.
  • Training Data Encoding: The training data is encoded into pairs of target words and their context words.
  • Weight Initialization: The weight matrices (embeddings and output weights) are initialized with random values.
  • Forward Pass: The average context vector is computed, and softmax is applied to obtain predicted probabilities.
  • Loss Calculation: The cross-entropy loss is computed.
  • Backward Pass and Weight Update: The gradients are calculated and the weights are updated using gradient descent.

Usage

To use the CBOW model, follow these steps:

  1. Initialize the Model:

    cbow = CBOW(embedding_dim=10, window_size=2, epochs=100, learning_rate=0.01)
  2. Train the model

    text = "This is an example text for training. This text will be used to train the CBOW model."
    cbow.train(text)
  3. Print the learned embeddings

    cbow.print_embeddings()

Training Results

After training the model, you can visualize the training loss over epochs and the learned word embeddings in a 3D space. Below are placeholders for these visualizations.

Loss Over Epochs

Loss

3D Visualization of Word Embeddings (picture, 3D only on Colab)

Emebddings

References

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint:1310.4546.
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
  • Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. arXiv preprint arXiv:1402.3722.

About

An implementation from scratch of Continuos-Bag-of-Words Word2Vec neural network model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors