Skip to content

cardmagic/classifier

Repository files navigation

Classifier

Gem Version CI License: LGPL

A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.

Documentation · Tutorials · Guides

Table of Contents

Installation

Add to your Gemfile:

gem 'classifier'

Then run:

bundle install

Or install directly:

gem install classifier

Native C Extension

The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.

To verify the native extension is active:

require 'classifier'
puts Classifier::LSI.backend  # => :native

To force pure Ruby mode (for debugging):

NATIVE_VECTOR=true ruby your_script.rb

To suppress the warning when native extension isn't available:

SUPPRESS_LSI_WARNING=true ruby your_script.rb

Compatibility

Ruby Version Status
4.0 Supported
3.4 Supported
3.3 Supported
3.2 Supported
3.1 EOL (unsupported)

Bayesian Classifier

Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.

Quick Start

require 'classifier'

classifier = Classifier::Bayes.new(:spam, :ham)

# Train with keyword arguments
classifier.train(spam: "Buy cheap viagra now! Limited offer!")
classifier.train(ham: "Meeting scheduled for tomorrow at 10am")

# Train multiple items at once
classifier.train(
  spam: ["You've won a million dollars!", "Free money!!!"],
  ham: ["Please review the document", "Lunch tomorrow?"]
)

# Classify new text
classifier.classify "Congratulations! You've won a prize!"
# => "Spam"

Learn More

LSI (Latent Semantic Indexing)

Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.

Quick Start

require 'classifier'

lsi = Classifier::LSI.new

# Add documents with hash-style syntax (category => item(s))
lsi.add("Pets" => "Dogs are loyal pets that love to play fetch")
lsi.add("Pets" => "Cats are independent and love to nap")
lsi.add("Programming" => "Ruby is a dynamic programming language")

# Add multiple items with the same category
lsi.add("Programming" => ["Python is great for data science", "JavaScript runs in browsers"])

# Batch operations with multiple categories
lsi.add(
  "Pets" => ["Hamsters are small furry pets", "Birds can be great companions"],
  "Programming" => "Go is fast and concurrent"
)

# Classify new text
lsi.classify "My puppy loves to run around"
# => "Pets"

# Get classification with confidence score
lsi.classify_with_confidence "Learning to code in Ruby"
# => ["Programming", 0.89]

Search and Discovery

# Find similar documents
lsi.find_related "Dogs are great companions", 2
# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]

# Search by keyword
lsi.search "programming", 3
# => ["Ruby is a dynamic programming language", "Python is great for..."]

Text Summarization

LSI can extract key sentences from text:

text = "First sentence about dogs. Second about cats. Third about birds."
text.summary(2)  # Extract 2 most relevant sentences

For better sentence boundary detection (handles abbreviations like "Dr.", decimals, etc.), install the optional pragmatic_segmenter gem:

gem 'pragmatic_segmenter'

Learn More

k-Nearest Neighbors (kNN)

Instance-based classification that stores examples and classifies by finding the most similar ones. No training phase required—just add examples and classify.

Key Features

  • No Training Required: Uses instance-based learning—store examples and classify by similarity
  • Interpretable Results: Returns neighbors that contributed to the decision
  • Incremental Updates: Easy to add or remove examples without retraining
  • Distance-Weighted Voting: Optional weighting by similarity score
  • Built on LSI: Leverages LSI's semantic similarity for better matching

Quick Start

require 'classifier'

knn = Classifier::KNN.new(k: 3)

# Add labeled examples
knn.add(spam: ["Buy now! Limited offer!", "You've won a million dollars!"])
knn.add(ham: ["Meeting at 3pm tomorrow", "Please review the document"])

# Classify new text
knn.classify "Congratulations! Claim your prize!"
# => "spam"

Detailed Classification

Get neighbor information for interpretable results:

result = knn.classify_with_neighbors "Free money offer"

result[:category]    # => "spam"
result[:confidence]  # => 0.85
result[:neighbors]   # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...]
result[:votes]       # => {"spam" => 2.0, "ham" => 1.0}

Distance-Weighted Voting

Weight votes by similarity score for more accurate classification:

knn = Classifier::KNN.new(k: 5, weighted: true)

knn.add(
  positive: ["Great product!", "Loved it!", "Excellent service"],
  negative: ["Terrible experience", "Would not recommend"]
)

# Closer neighbors have more influence on the result
knn.classify "This was amazing!"
# => "positive"

Updating the Classifier

# Add more examples anytime
knn.add(neutral: "It was okay, nothing special")

# Remove examples
knn.remove_item "Buy now! Limited offer!"

# Change k value
knn.k = 7

# List all categories
knn.categories
# => ["spam", "ham", "neutral"]

When to Use kNN vs Bayes vs LSI

Classifier Best For
Bayes Fast classification, any training size (stores only word counts)
LSI Semantic similarity, document clustering, search
kNN <1000 examples, interpretable results, incremental updates

Why the size difference? Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.

TF-IDF Vectorizer

Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.

Quick Start

require 'classifier'

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])

# Transform text to TF-IDF vector (L2 normalized)
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}

# Fit and transform in one step
vectors = tfidf.fit_transform(documents)

Options

tfidf = Classifier::TFIDF.new(
  min_df: 2,           # Minimum document frequency (Integer or Float 0.0-1.0)
  max_df: 0.95,        # Maximum document frequency (filters very common terms)
  ngram_range: [1, 2], # Extract unigrams and bigrams
  sublinear_tf: true   # Use 1 + log(tf) instead of raw term frequency
)

Vocabulary Inspection

tfidf.fit(documents)

tfidf.vocabulary      # => {:dog=>0, :cat=>1, :bird=>2, ...}
tfidf.idf             # => {:dog=>1.405, :cat=>1.405, ...}
tfidf.feature_names   # => [:dog, :cat, :bird, ...]
tfidf.num_documents   # => 3
tfidf.fitted?         # => true

N-gram Support

# Extract bigrams only
tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
tfidf.fit(["quick brown fox", "lazy brown dog"])
tfidf.vocabulary.keys
# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]

# Unigrams through trigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])

Serialization

# Save to JSON
json = tfidf.to_json
File.write("tfidf.json", json)

# Load from JSON
loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))

# Or use Marshal
data = Marshal.dump(tfidf)
loaded = Marshal.load(data)

Persistence

Save and load classifiers with pluggable storage backends. Works with Bayes, LSI, and kNN classifiers.

File Storage

require 'classifier'

classifier = Classifier::Bayes.new(:spam, :ham)
classifier.train(spam: "Buy now! Limited offer!")
classifier.train(ham: "Meeting tomorrow at 3pm")

# Configure storage and save
classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
classifier.save

# Load later
loaded = Classifier::Bayes.load(storage: classifier.storage)
loaded.classify "Claim your prize now!"
# => "Spam"

Custom Storage Backends

Create backends for Redis, PostgreSQL, S3, or any storage system:

class RedisStorage < Classifier::Storage::Base
  def initialize(redis:, key:)
    super()
    @redis, @key = redis, key
  end

  def write(data) = @redis.set(@key, data)
  def read = @redis.get(@key)
  def delete = @redis.del(@key)
  def exists? = @redis.exists?(@key)
end

# Use it
classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
classifier.save

Learn More

Performance

Native C Extension vs Pure Ruby

The native C extension provides dramatic speedups for LSI operations, especially build_index (SVD computation):

Documents build_index Overall
5 7x faster 2.6x
10 25x faster 4.6x
15 112x faster 14.5x
20 385x faster 48.7x
Detailed benchmark (20 documents)
Operation            Pure Ruby     Native C      Speedup
----------------------------------------------------------
build_index            0.5540       0.0014       384.5x
classify               0.0190       0.0060         3.2x
search                 0.0145       0.0037         3.9x
find_related           0.0098       0.0011         8.6x
----------------------------------------------------------
TOTAL                  0.5973       0.0123        48.7x

Running Benchmarks

rake benchmark              # Run with current configuration
rake benchmark:compare      # Compare native C vs pure Ruby

Development

Setup

git clone https://github.com/cardmagic/classifier.git
cd classifier
bundle install
rake compile  # Compile native C extension

Running Tests

rake test                        # Run all tests (compiles first)
ruby -Ilib test/bayes/bayesian_test.rb  # Run specific test file

# Test with pure Ruby (no native extension)
NATIVE_VECTOR=true rake test

Console

rake console

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -am 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Authors

License

This library is released under the GNU Lesser General Public License (LGPL) 2.1.