Classifier

A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.

Documentation · Tutorials · Guides

Installation

Add to your Gemfile:

gem 'classifier'

Then run:

bundle install

Or install directly:

gem install classifier

Native C Extension

The gem includes a native C extension for fast LSI operations. It compiles automatically during gem installation. No external dependencies are required.

To verify the native extension is active:

require 'classifier'
puts Classifier::LSI.backend  # => :native

To force pure Ruby mode (for debugging):

NATIVE_VECTOR=true ruby your_script.rb

To suppress the warning when native extension isn't available:

SUPPRESS_LSI_WARNING=true ruby your_script.rb

Compatibility

Ruby Version	Status
4.0	Supported
3.4	Supported
3.3	Supported
3.2	Supported
3.1	EOL (unsupported)

Bayesian Classifier

Fast, accurate classification with modest memory requirements. Ideal for spam filtering, sentiment analysis, and content categorization.

Quick Start

require 'classifier'

classifier = Classifier::Bayes.new(:spam, :ham)

# Train with keyword arguments
classifier.train(spam: "Buy cheap viagra now! Limited offer!")
classifier.train(ham: "Meeting scheduled for tomorrow at 10am")

# Train multiple items at once
classifier.train(
  spam: ["You've won a million dollars!", "Free money!!!"],
  ham: ["Please review the document", "Lunch tomorrow?"]
)

# Classify new text
classifier.classify "Congratulations! You've won a prize!"
# => "Spam"

Learn More

Bayes Basics Guide - In-depth documentation
Build a Spam Filter Tutorial - Step-by-step guide
Paul Graham: A Plan for Spam

LSI (Latent Semantic Indexing)

Semantic analysis using Singular Value Decomposition (SVD). More flexible than Bayesian classifiers, providing search, clustering, and classification based on meaning rather than just keywords.

Quick Start

require 'classifier'

lsi = Classifier::LSI.new

# Add documents with hash-style syntax (category => item(s))
lsi.add("Pets" => "Dogs are loyal pets that love to play fetch")
lsi.add("Pets" => "Cats are independent and love to nap")
lsi.add("Programming" => "Ruby is a dynamic programming language")

# Add multiple items with the same category
lsi.add("Programming" => ["Python is great for data science", "JavaScript runs in browsers"])

# Batch operations with multiple categories
lsi.add(
  "Pets" => ["Hamsters are small furry pets", "Birds can be great companions"],
  "Programming" => "Go is fast and concurrent"
)

# Classify new text
lsi.classify "My puppy loves to run around"
# => "Pets"

# Get classification with confidence score
lsi.classify_with_confidence "Learning to code in Ruby"
# => ["Programming", 0.89]

Search and Discovery

# Find similar documents
lsi.find_related "Dogs are great companions", 2
# => ["Dogs are loyal pets that love to play fetch", "Cats are independent..."]

# Search by keyword
lsi.search "programming", 3
# => ["Ruby is a dynamic programming language", "Python is great for..."]

Text Summarization

LSI can extract key sentences from text:

text = "First sentence about dogs. Second about cats. Third about birds."
text.summary(2)  # Extract 2 most relevant sentences

For better sentence boundary detection (handles abbreviations like "Dr.", decimals, etc.), install the optional pragmatic_segmenter gem:

gem 'pragmatic_segmenter'

Learn More

LSI Basics Guide - In-depth documentation
Wikipedia: Latent Semantic Analysis

k-Nearest Neighbors (kNN)

Instance-based classification that stores examples and classifies by finding the most similar ones. No training phase required—just add examples and classify.

Key Features

No Training Required: Uses instance-based learning—store examples and classify by similarity
Interpretable Results: Returns neighbors that contributed to the decision
Incremental Updates: Easy to add or remove examples without retraining
Distance-Weighted Voting: Optional weighting by similarity score
Built on LSI: Leverages LSI's semantic similarity for better matching

Quick Start

require 'classifier'

knn = Classifier::KNN.new(k: 3)

# Add labeled examples
knn.add(spam: ["Buy now! Limited offer!", "You've won a million dollars!"])
knn.add(ham: ["Meeting at 3pm tomorrow", "Please review the document"])

# Classify new text
knn.classify "Congratulations! Claim your prize!"
# => "spam"

Detailed Classification

Get neighbor information for interpretable results:

result = knn.classify_with_neighbors "Free money offer"

result[:category]    # => "spam"
result[:confidence]  # => 0.85
result[:neighbors]   # => [{item: "Buy now!...", category: "spam", similarity: 0.92}, ...]
result[:votes]       # => {"spam" => 2.0, "ham" => 1.0}

Distance-Weighted Voting

Weight votes by similarity score for more accurate classification:

knn = Classifier::KNN.new(k: 5, weighted: true)

knn.add(
  positive: ["Great product!", "Loved it!", "Excellent service"],
  negative: ["Terrible experience", "Would not recommend"]
)

# Closer neighbors have more influence on the result
knn.classify "This was amazing!"
# => "positive"

Updating the Classifier

# Add more examples anytime
knn.add(neutral: "It was okay, nothing special")

# Remove examples
knn.remove_item "Buy now! Limited offer!"

# Change k value
knn.k = 7

# List all categories
knn.categories
# => ["spam", "ham", "neutral"]

When to Use kNN vs Bayes vs LSI

Classifier	Best For
Bayes	Fast classification, any training size (stores only word counts)
LSI	Semantic similarity, document clustering, search
kNN	<1000 examples, interpretable results, incremental updates

Why the size difference? Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.

TF-IDF Vectorizer

Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.

Quick Start

require 'classifier'

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])

# Transform text to TF-IDF vector (L2 normalized)
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}

# Fit and transform in one step
vectors = tfidf.fit_transform(documents)

Options

tfidf = Classifier::TFIDF.new(
  min_df: 2,           # Minimum document frequency (Integer or Float 0.0-1.0)
  max_df: 0.95,        # Maximum document frequency (filters very common terms)
  ngram_range: [1, 2], # Extract unigrams and bigrams
  sublinear_tf: true   # Use 1 + log(tf) instead of raw term frequency
)

Vocabulary Inspection

tfidf.fit(documents)

tfidf.vocabulary      # => {:dog=>0, :cat=>1, :bird=>2, ...}
tfidf.idf             # => {:dog=>1.405, :cat=>1.405, ...}
tfidf.feature_names   # => [:dog, :cat, :bird, ...]
tfidf.num_documents   # => 3
tfidf.fitted?         # => true

N-gram Support

# Extract bigrams only
tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
tfidf.fit(["quick brown fox", "lazy brown dog"])
tfidf.vocabulary.keys
# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]

# Unigrams through trigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])

Serialization

# Save to JSON
json = tfidf.to_json
File.write("tfidf.json", json)

# Load from JSON
loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))

# Or use Marshal
data = Marshal.dump(tfidf)
loaded = Marshal.load(data)

Persistence

Save and load classifiers with pluggable storage backends. Works with Bayes, LSI, and kNN classifiers.

File Storage

require 'classifier'

classifier = Classifier::Bayes.new(:spam, :ham)
classifier.train(spam: "Buy now! Limited offer!")
classifier.train(ham: "Meeting tomorrow at 3pm")

# Configure storage and save
classifier.storage = Classifier::Storage::File.new(path: "spam_filter.json")
classifier.save

# Load later
loaded = Classifier::Bayes.load(storage: classifier.storage)
loaded.classify "Claim your prize now!"
# => "Spam"

Custom Storage Backends

Create backends for Redis, PostgreSQL, S3, or any storage system:

class RedisStorage < Classifier::Storage::Base
  def initialize(redis:, key:)
    super()
    @redis, @key = redis, key
  end

  def write(data) = @redis.set(@key, data)
  def read = @redis.get(@key)
  def delete = @redis.del(@key)
  def exists? = @redis.exists?(@key)
end

# Use it
classifier.storage = RedisStorage.new(redis: Redis.new, key: "classifier:spam")
classifier.save

Learn More

Persistence Guide - Full documentation with examples

Performance

Native C Extension vs Pure Ruby

The native C extension provides dramatic speedups for LSI operations, especially build_index (SVD computation):

Documents	build_index	Overall
5	7x faster	2.6x
10	25x faster	4.6x
15	112x faster	14.5x
20	385x faster	48.7x

Detailed benchmark (20 documents)

Operation            Pure Ruby     Native C      Speedup
----------------------------------------------------------
build_index            0.5540       0.0014       384.5x
classify               0.0190       0.0060         3.2x
search                 0.0145       0.0037         3.9x
find_related           0.0098       0.0011         8.6x
----------------------------------------------------------
TOTAL                  0.5973       0.0123        48.7x

Running Benchmarks

rake benchmark              # Run with current configuration
rake benchmark:compare      # Compare native C vs pure Ruby

Development

Setup

git clone https://github.com/cardmagic/classifier.git
cd classifier
bundle install
rake compile  # Compile native C extension

Running Tests

rake test                        # Run all tests (compiles first)
ruby -Ilib test/bayes/bayesian_test.rb  # Run specific test file

# Test with pure Ruby (no native extension)
NATIVE_VECTOR=true rake test

Console

rake console

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -am 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Authors

Lucas Carlson - Original author - lucas@rufy.com
David Fayram II - LSI implementation - dfayram@gmail.com
Cameron McBride - cameron.mcbride@gmail.com
Ivan Acosta-Rubio - ivan@softwarecriollo.com

License

This library is released under the GNU Lesser General Public License (LGPL) 2.1.

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
benchmark		benchmark
bin		bin
ext/classifier		ext/classifier
lib		lib
sig/vendor		sig/vendor
test		test
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
CLAUDE.md		CLAUDE.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
Steepfile		Steepfile
classifier.gemspec		classifier.gemspec
cloving.json		cloving.json
install.rb		install.rb

License

cardmagic/classifier

Folders and files

Latest commit

History

Repository files navigation

Classifier

Table of Contents

Installation

Native C Extension

Compatibility

Bayesian Classifier

Quick Start

Learn More

LSI (Latent Semantic Indexing)

Quick Start

Search and Discovery

Text Summarization

Learn More

k-Nearest Neighbors (kNN)

Key Features

Quick Start

Detailed Classification

Distance-Weighted Voting

Updating the Classifier

When to Use kNN vs Bayes vs LSI

TF-IDF Vectorizer

Quick Start

Options

Vocabulary Inspection

N-gram Support

Serialization

Persistence

File Storage

Custom Storage Backends

Learn More

Performance

Native C Extension vs Pure Ruby

Running Benchmarks

Development

Setup

Running Tests

Console

Contributing

Authors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Used by 15.7k

Contributors 11

Uh oh!

Languages

Packages