CBOW Model for Word Embeddings

This repository contains a Python implementation of the Continuous Bag of Words (CBOW) model for learning word embeddings. This implementation is intended for a college homework assignment.

Introduction

The Continuous Bag of Words (CBOW) model is a neural network-based approach for learning word embeddings. In CBOW, the context words are used to predict the target word. This implementation allows you to train a CBOW model on a given text corpus and learn embeddings for each word in the vocabulary.

Implementation Details

The CBOW model in this repository is implemented in Python using NumPy. The key components of the implementation include:

Sentence Separation: The input text is split into sentences based on punctuation.
Vocabulary Creation: A vocabulary is created from the training data, mapping each word to a unique index.
One-Hot Encoding: Words are converted into one-hot encoded vectors.
Training Data Encoding: The training data is encoded into pairs of target words and their context words.
Weight Initialization: The weight matrices (embeddings and output weights) are initialized with random values.
Forward Pass: The average context vector is computed, and softmax is applied to obtain predicted probabilities.
Loss Calculation: The cross-entropy loss is computed.
Backward Pass and Weight Update: The gradients are calculated and the weights are updated using gradient descent.

Usage

To use the CBOW model, follow these steps:

Initialize the Model:

cbow = CBOW(embedding_dim=10, window_size=2, epochs=100, learning_rate=0.01)

Train the model

text = "This is an example text for training. This text will be used to train the CBOW model."
cbow.train(text)

Print the learned embeddings
```
cbow.print_embeddings()
```

Training Results

After training the model, you can visualize the training loss over epochs and the learned word embeddings in a 3D space. Below are placeholders for these visualizations.

Loss Over Epochs

3D Visualization of Word Embeddings (picture, 3D only on Colab)

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint:1310.4546.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. arXiv preprint arXiv:1402.3722.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CBOW.ipynb		CBOW.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CBOW Model for Word Embeddings

Table of Contents

Introduction

Implementation Details

Usage

Training Results

Loss Over Epochs

3D Visualization of Word Embeddings (picture, 3D only on Colab)

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CBOW Model for Word Embeddings

Table of Contents

Introduction

Implementation Details

Usage

Training Results

Loss Over Epochs

3D Visualization of Word Embeddings (picture, 3D only on Colab)

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages