Skip to content

feat: Add Word2Vec/word embeddings support for better semantic similarity #106

@cardmagic

Description

@cardmagic

Problem

LSI classification fails when test documents have no vocabulary overlap with training data. This is a fundamental limitation of bag-of-words approaches.

Example: Classifying poetry by genre fails because:

  • "Deep into that darkness peering, fearing" → stems to: dark, deep, fear, peer
  • "Once upon a midnight dreary, weary" → stems to: dreari, midnight, weari
  • Zero overlap → similarity score = 0.0 → random classification

Proposed Solution

Add optional Word2Vec/word embeddings support that can find semantic similarity even without exact word matches:

  • "fear" and "dread" → similar vectors → high similarity
  • "darkness" and "midnight" → related concepts → some similarity

Implementation Options

Option 1: word2vec-rb gem (recommended)

  • Ruby gem with C extensions wrapping Google's word2vec
  • Supports loading pre-trained models AND training custom ones
  • Last updated May 2022, MIT licensed
  • https://github.com/madcato/word2vec-rb

Option 2: Pre-trained embeddings only

  • Download GloVe or Word2Vec pre-trained vectors
  • Load on demand, cache in memory
  • Simpler but requires large file downloads (~1GB for GloVe)

Proposed API

# Option A: New LSI backend
lsi = Classifier::LSI.new(similarity: :word2vec)
lsi.load_embeddings("path/to/vectors.bin")  # or download pre-trained

# Option B: New classifier type
w2v = Classifier::Word2Vec.new
w2v.load_embeddings("glove.6B.100d.txt")
w2v.add("gothic" => ["darkness", "midnight", "fear"])
w2v.classify("shadows and dread")  # Works even without exact matches

Document Similarity with Word Embeddings

Convert documents to vectors by averaging word embeddings:

  1. Tokenize document → words
  2. Look up each word's embedding vector
  3. Average all vectors → document vector
  4. Compare document vectors with cosine similarity

Considerations

  • Optional dependency: word2vec-rb should be optional, not required
  • Memory usage: Word embeddings are large (~100-300 dimensions × vocabulary size)
  • Pre-trained models: Provide helper to download common models (GloVe, Word2Vec)
  • Fallback: Gracefully fall back to bag-of-words if embeddings unavailable

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions