This repository contains a Python implementation of the Continuous Bag of Words (CBOW) model for learning word embeddings. This implementation is intended for a college homework assignment.
The Continuous Bag of Words (CBOW) model is a neural network-based approach for learning word embeddings. In CBOW, the context words are used to predict the target word. This implementation allows you to train a CBOW model on a given text corpus and learn embeddings for each word in the vocabulary.
The CBOW model in this repository is implemented in Python using NumPy. The key components of the implementation include:
- Sentence Separation: The input text is split into sentences based on punctuation.
- Vocabulary Creation: A vocabulary is created from the training data, mapping each word to a unique index.
- One-Hot Encoding: Words are converted into one-hot encoded vectors.
- Training Data Encoding: The training data is encoded into pairs of target words and their context words.
- Weight Initialization: The weight matrices (embeddings and output weights) are initialized with random values.
- Forward Pass: The average context vector is computed, and softmax is applied to obtain predicted probabilities.
- Loss Calculation: The cross-entropy loss is computed.
- Backward Pass and Weight Update: The gradients are calculated and the weights are updated using gradient descent.
To use the CBOW model, follow these steps:
-
Initialize the Model:
cbow = CBOW(embedding_dim=10, window_size=2, epochs=100, learning_rate=0.01)
-
Train the model
text = "This is an example text for training. This text will be used to train the CBOW model." cbow.train(text)
-
Print the learned embeddings
cbow.print_embeddings()
After training the model, you can visualize the training loss over epochs and the learned word embeddings in a 3D space. Below are placeholders for these visualizations.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint:1310.4546.
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
- Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. arXiv preprint arXiv:1402.3722.

