-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Description
Problem
LSI classification fails when test documents have no vocabulary overlap with training data. This is a fundamental limitation of bag-of-words approaches.
Example: Classifying poetry by genre fails because:
- "Deep into that darkness peering, fearing" → stems to:
dark, deep, fear, peer - "Once upon a midnight dreary, weary" → stems to:
dreari, midnight, weari - Zero overlap → similarity score = 0.0 → random classification
Proposed Solution
Add optional Word2Vec/word embeddings support that can find semantic similarity even without exact word matches:
- "fear" and "dread" → similar vectors → high similarity
- "darkness" and "midnight" → related concepts → some similarity
Implementation Options
Option 1: word2vec-rb gem (recommended)
- Ruby gem with C extensions wrapping Google's word2vec
- Supports loading pre-trained models AND training custom ones
- Last updated May 2022, MIT licensed
- https://github.com/madcato/word2vec-rb
Option 2: Pre-trained embeddings only
- Download GloVe or Word2Vec pre-trained vectors
- Load on demand, cache in memory
- Simpler but requires large file downloads (~1GB for GloVe)
Proposed API
# Option A: New LSI backend
lsi = Classifier::LSI.new(similarity: :word2vec)
lsi.load_embeddings("path/to/vectors.bin") # or download pre-trained
# Option B: New classifier type
w2v = Classifier::Word2Vec.new
w2v.load_embeddings("glove.6B.100d.txt")
w2v.add("gothic" => ["darkness", "midnight", "fear"])
w2v.classify("shadows and dread") # Works even without exact matchesDocument Similarity with Word Embeddings
Convert documents to vectors by averaging word embeddings:
- Tokenize document → words
- Look up each word's embedding vector
- Average all vectors → document vector
- Compare document vectors with cosine similarity
Considerations
- Optional dependency: word2vec-rb should be optional, not required
- Memory usage: Word embeddings are large (~100-300 dimensions × vocabulary size)
- Pre-trained models: Provide helper to download common models (GloVe, Word2Vec)
- Fallback: Gracefully fall back to bag-of-words if embeddings unavailable
References
Metadata
Metadata
Assignees
Labels
No labels