Quadratic word embedding models

Here is a blog post explaining the main results.

Recommended: 32GB CPU RAM, 16GB GPU RAM.

install uv (if not already in your system) and run uv sync.
modify *PATH variables in data/env.sh, then run script to set environment variables. optional: paste these into your bashrc.
run uv run data/prepare_text8.py (one-time setup) and uv run data/prepare_analogies.py
run uv run expts/example.py to train a model
run uv run compute_cooccurrence.py text8 10000 to explicitly construct the co-occurrence statistics. Use this to construct M* and factorize in closed form, circumventing the need for gradient descent.

Code flow:

the scripts in data/* are for one-time dataset download and setup
the files in expts/*.py contain the hyperparameters and birds-eye structure of each experiment
the scripts in launch/*.sh launch the figure-generating experiments on a GPU node
the notebooks notebooks/*.ipynb render the figures after the experiments are run and the results are saved
qwem.py contains the logic for the training loop
compute_cooccurrence.py explicitly constructs the co-occurrence matrix for a corpus
utils.py defines helper classes for handling hyperparameters, vocabulary, model evaluation, etc.
ExptTrace.py and FileManager.py define more helper classes
the directory word2vec_tied/ contains the original SGNS implementation with tied weights. See https://github.com/tmikolov/word2vec/

Provide feedback

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
expts		expts
launch		launch
notebooks		notebooks
word2vec_tied		word2vec_tied
.gitignore		.gitignore
.python-version		.python-version
ExptTrace.py		ExptTrace.py
FileManager.py		FileManager.py
README.md		README.md
compute_cooccurrence.py		compute_cooccurrence.py
pyproject.toml		pyproject.toml
qwem.py		qwem.py
utils.py		utils.py
uv.lock		uv.lock