This repo contains the experiments in the paper Closed-Form Training Dynamics Reveal Learned Features and Linear Structure in Word2Vec-like Models, appearing at NeurIPS 2025.
Here is a blog post explaining the main results.
Recommended: 32GB CPU RAM, 16GB GPU RAM.
- install
uv(if not already in your system) and runuv sync. - modify *PATH variables in
data/env.sh, then run script to set environment variables. optional: paste these into your bashrc. - run
uv run data/prepare_text8.py(one-time setup) anduv run data/prepare_analogies.py - run
uv run expts/example.pyto train a model - run
uv run compute_cooccurrence.py text8 10000to explicitly construct the co-occurrence statistics. Use this to construct M* and factorize in closed form, circumventing the need for gradient descent.
Code flow:
- the scripts in
data/*are for one-time dataset download and setup - the files in
expts/*.pycontain the hyperparameters and birds-eye structure of each experiment - the scripts in
launch/*.shlaunch the figure-generating experiments on a GPU node - the notebooks
notebooks/*.ipynbrender the figures after the experiments are run and the results are saved qwem.pycontains the logic for the training loopcompute_cooccurrence.pyexplicitly constructs the co-occurrence matrix for a corpusutils.pydefines helper classes for handling hyperparameters, vocabulary, model evaluation, etc.ExptTrace.pyandFileManager.pydefine more helper classes- the directory
word2vec_tied/contains the original SGNS implementation with tied weights. See https://github.com/tmikolov/word2vec/