A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.
Documentation · Tutorials · Guides
- Installation
- Bayesian Classifier
- LSI (Latent Semantic Indexing)
- k-Nearest Neighbors (kNN)
- TF-IDF Vectorizer
- Persistence
- Performance
- Development
- Contributing
- License
Add to your Gemfile:
gem 'classifier'Then run:
bundle installOr install directly:
gem install classifierThe gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.
To verify the native extension is active:
require 'classifier'
puts Classifier::LSI.backend # => :nativeTo force pure Ruby mode (for debugging):
NATIVE_VECTOR=true ruby your_script.rbTo suppress the warning when native extension isn't available:
SUPPRESS_LSI_WARNING=true ruby your_script.rb| Ruby Version | Status |
|---|---|
| 4.0 | Supported |
| 3.4 | Supported |
| 3.3 | Supported |
| 3.2 | Supported |
| 3.1 | EOL (unsupported) |
Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.
require 'classifier'
classifier = Classifier::Bayes.new(:spam, :ham)
# Train with keyword arguments
classifier.train(spam: "Buy cheap viagra now! Limited offer!")
classifier.train(ham: "Meeting scheduled for tomorrow at 10am")
# Train multiple items at once
classifier.train(
spam: ["You've won a million dollars!", "Free money!!!"],
ham: ["Please review the document", "Lunch tomorrow?"]
)
# Classify new text
classifier.classify "Congratulations! You've won a prize!"
# => "Spam"- Bayes Basics Guide - In-depth documentation
- Build a Spam Filter Tutorial - Step-by-step guide
- Paul Graham: A Plan for Spam
Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.
require 'classifier'
lsi = Classifier::LSI.new
# Add documents with hash-style syntax (category => item(s))
lsi.add("Pets" => "Dogs are loyal pets that love to play fetch")
lsi.add("Pets" => "Cats are independent and love to nap")
lsi.add("Programming" => "Ruby is a dynamic programming language")
# Add multiple items with the same category
lsi.add("Programming" => ["Python is great for data science", "JavaScript runs in browsers"])
# Batch operations with multiple categories
lsi.add(
"Pets" => ["Hamsters are small furry pets", "Birds can be great companions"],
"Programming" => "Go is fast and concurrent"
)
# Classify new text
lsi.classify "My puppy loves to run around"
# => "Pets"
# Get classification with confidence score
lsi.classify_with_confidence "Learning to code in Ruby"
# => ["Programming", 0.89]# Find similar documents
lsi.find_related "Dogs are great companions", 2
# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]
# Search by keyword
lsi.search "programming", 3
# => ["Ruby is a dynamic programming language", "Python is great for..."]LSI can extract key sentences from text:
text = "First sentence about dogs. Second about cats. Third about birds."
text.summary(2) # Extract 2 most relevant sentencesFor better sentence boundary detection (handles abbreviations like "Dr.", decimals, etc.), install the optional pragmatic_segmenter gem:
gem 'pragmatic_segmenter'- LSI Basics Guide - In-depth documentation
- Wikipedia: Latent Semantic Analysis
Instance-based classification that stores examples and classifies by finding the most similar ones. No training phase required—just add examples and classify.
- No Training Required: Uses instance-based learning—store examples and classify by similarity
- Interpretable Results: Returns neighbors that contributed to the decision
- Incremental Updates: Easy to add or remove examples without retraining
- Distance-Weighted Voting: Optional weighting by similarity score
- Built on LSI: Leverages LSI's semantic similarity for better matching
require 'classifier'
knn = Classifier::KNN.new(k: 3)
# Add labeled examples
knn.add(spam: ["Buy now! Limited offer!", "You've won a million dollars!"])
knn.add(ham: ["Meeting at 3pm tomorrow", "Please review the document"])
# Classify new text
knn.classify "Congratulations! Claim your prize!"
# => "spam"Get neighbor information for interpretable results:
result = knn.classify_with_neighbors "Free money offer"
result[:category] # => "spam"
result[:confidence] # => 0.85
result[:neighbors] # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...]
result[:votes] # => {"spam" => 2.0, "ham" => 1.0}Weight votes by similarity score for more accurate classification:
knn = Classifier::KNN.new(k: 5, weighted: true)
knn.add(
positive: ["Great product!", "Loved it!", "Excellent service"],
negative: ["Terrible experience", "Would not recommend"]
)
# Closer neighbors have more influence on the result
knn.classify "This was amazing!"
# => "positive"# Add more examples anytime
knn.add(neutral: "It was okay, nothing special")
# Remove examples
knn.remove_item "Buy now! Limited offer!"
# Change k value
knn.k = 7
# List all categories
knn.categories
# => ["spam", "ham", "neutral"]| Classifier | Best For |
|---|---|
| Bayes | Fast classification, any training size (stores only word counts) |
| LSI | Semantic similarity, document clustering, search |
| kNN | <1000 examples, interpretable results, incremental updates |
Why the size difference? Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.
Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.
require 'classifier'
tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])
# Transform text to TF-IDF vector (L2 normalized)
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}
# Fit and transform in one step
vectors = tfidf.fit_transform(documents)tfidf = Classifier::TFIDF.new(
min_df: 2, # Minimum document frequency (Integer or Float 0.0-1.0)
max_df: 0.95, # Maximum document frequency (filters very common terms)
ngram_range: [1, 2], # Extract unigrams and bigrams
sublinear_tf: true # Use 1 + log(tf) instead of raw term frequency
)tfidf.fit(documents)
tfidf.vocabulary # => {:dog=>0, :cat=>1, :bird=>2, ...}
tfidf.idf # => {:dog=>1.405, :cat=>1.405, ...}
tfidf.feature_names # => [:dog, :cat, :bird, ...]
tfidf.num_documents # => 3
tfidf.fitted? # => true# Extract bigrams only
tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
tfidf.fit(["quick brown fox", "lazy brown dog"])
tfidf.vocabulary.keys
# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]
# Unigrams through trigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])# Save to JSON
json = tfidf.to_json
File.write("tfidf.json", json)
# Load from JSON
loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))
# Or use Marshal
data = Marshal.dump(tfidf)
loaded = Marshal.load(data)Save and load classifiers with pluggable storage backends. Works with Bayes, LSI, and kNN classifiers.
require 'classifier'
classifier = Classifier::Bayes.new(:spam, :ham)
classifier.train(spam: "Buy now! Limited offer!")
classifier.train(ham: "Meeting tomorrow at 3pm")
# Configure storage and save
classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
classifier.save
# Load later
loaded = Classifier::Bayes.load(storage: classifier.storage)
loaded.classify "Claim your prize now!"
# => "Spam"Create backends for Redis, PostgreSQL, S3, or any storage system:
class RedisStorage < Classifier::Storage::Base
def initialize(redis:, key:)
super()
@redis, @key = redis, key
end
def write(data) = @redis.set(@key, data)
def read = @redis.get(@key)
def delete = @redis.del(@key)
def exists? = @redis.exists?(@key)
end
# Use it
classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
classifier.save- Persistence Guide - Full documentation with examples
The native C extension provides dramatic speedups for LSI operations, especially build_index (SVD computation):
| Documents | build_index | Overall |
|---|---|---|
| 5 | 7x faster | 2.6x |
| 10 | 25x faster | 4.6x |
| 15 | 112x faster | 14.5x |
| 20 | 385x faster | 48.7x |
Detailed benchmark (20 documents)
Operation Pure Ruby Native C Speedup
----------------------------------------------------------
build_index 0.5540 0.0014 384.5x
classify 0.0190 0.0060 3.2x
search 0.0145 0.0037 3.9x
find_related 0.0098 0.0011 8.6x
----------------------------------------------------------
TOTAL 0.5973 0.0123 48.7x
rake benchmark # Run with current configuration
rake benchmark:compare # Compare native C vs pure Rubygit clone https://github.com/cardmagic/classifier.git
cd classifier
bundle install
rake compile # Compile native C extensionrake test # Run all tests (compiles first)
ruby -Ilib test/bayes/bayesian_test.rb # Run specific test file
# Test with pure Ruby (no native extension)
NATIVE_VECTOR=true rake testrake console- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -am 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Lucas Carlson - Original author - lucas@rufy.com
- David Fayram II - LSI implementation - dfayram@gmail.com
- Cameron McBride - cameron.mcbride@gmail.com
- Ivan Acosta-Rubio - ivan@softwarecriollo.com
This library is released under the GNU Lesser General Public License (LGPL) 2.1.