Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 73 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
[![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)

A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), and k-Nearest Neighbors (kNN) algorithms.
A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.

**[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)**

Expand All @@ -14,6 +14,7 @@ A Ruby library for text classification using Bayesian, LSI (Latent Semantic Inde
- [Bayesian Classifier](#bayesian-classifier)
- [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
- [k-Nearest Neighbors (kNN)](#k-nearest-neighbors-knn)
- [TF-IDF Vectorizer](#tf-idf-vectorizer)
- [Persistence](#persistence)
- [Performance](#performance)
- [Development](#development)
Expand Down Expand Up @@ -256,6 +257,77 @@ knn.categories

**Why the size difference?** Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.

## TF-IDF Vectorizer

Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.

### Quick Start

```ruby
require 'classifier'

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])

# Transform text to TF-IDF vector (L2 normalized)
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}

# Fit and transform in one step
vectors = tfidf.fit_transform(documents)
```

### Options

```ruby
tfidf = Classifier::TFIDF.new(
min_df: 2, # Minimum document frequency (Integer or Float 0.0-1.0)
max_df: 0.95, # Maximum document frequency (filters very common terms)
ngram_range: [1, 2], # Extract unigrams and bigrams
sublinear_tf: true # Use 1 + log(tf) instead of raw term frequency
)
```

### Vocabulary Inspection

```ruby
tfidf.fit(documents)

tfidf.vocabulary # => {:dog=>0, :cat=>1, :bird=>2, ...}
tfidf.idf # => {:dog=>1.405, :cat=>1.405, ...}
tfidf.feature_names # => [:dog, :cat, :bird, ...]
tfidf.num_documents # => 3
tfidf.fitted? # => true
```

### N-gram Support

```ruby
# Extract bigrams only
tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
tfidf.fit(["quick brown fox", "lazy brown dog"])
tfidf.vocabulary.keys
# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]

# Unigrams through trigrams
tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])
```

### Serialization

```ruby
# Save to JSON
json = tfidf.to_json
File.write("tfidf.json", json)

# Load from JSON
loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))

# Or use Marshal
data = Marshal.dump(tfidf)
loaded = Marshal.load(data)
```

## Persistence

Save and load classifiers with pluggable storage backends. Works with Bayes, LSI, and kNN classifiers.
Expand Down
1 change: 1 addition & 0 deletions lib/classifier.rb
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,4 @@
require 'classifier/bayes'
require 'classifier/lsi'
require 'classifier/knn'
require 'classifier/tfidf'
3 changes: 3 additions & 0 deletions lib/classifier/errors.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ class UnsavedChangesError < Error; end

# Raised when a storage operation fails
class StorageError < Error; end

# Raised when using an unfitted model
class NotFittedError < Error; end
end
252 changes: 252 additions & 0 deletions lib/classifier/tfidf.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
# rbs_inline: enabled

# Author:: Lucas Carlson (mailto:lucas@rufy.com)
# Copyright:: Copyright (c) 2024 Lucas Carlson
# License:: LGPL

require 'json'

module Classifier
# TF-IDF vectorizer: transforms text to weighted feature vectors.
# Downweights common words, upweights discriminative terms.
#
# Example:
# tfidf = Classifier::TFIDF.new
# tfidf.fit(["Dogs are great pets", "Cats are independent"])
# tfidf.transform("Dogs are loyal") # => {:dog=>0.7071..., :loyal=>0.7071...}
#
class TFIDF
# @rbs @min_df: Integer | Float
# @rbs @max_df: Integer | Float
# @rbs @ngram_range: Array[Integer]
# @rbs @sublinear_tf: bool
# @rbs @vocabulary: Hash[Symbol, Integer]
# @rbs @idf: Hash[Symbol, Float]
# @rbs @num_documents: Integer
# @rbs @fitted: bool

attr_reader :vocabulary, :idf, :num_documents

# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
Comment on lines +30 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.

Suggested change
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:33

Comment:
**style:** Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.

```suggestion
    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

#
Comment on lines +30 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comments restate what's obvious from parameter names and code. Per style guide: remove.

Suggested change
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
#
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:34

Comment:
**style:** Comments restate what's obvious from parameter names and code. Per style guide: remove.

```suggestion
    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
validate_df!(min_df, 'min_df')
validate_df!(max_df, 'max_df')
validate_ngram_range!(ngram_range)

@min_df = min_df
@max_df = max_df
@ngram_range = ngram_range
@sublinear_tf = sublinear_tf
@vocabulary = {}
@idf = {}
@num_documents = 0
@fitted = false
end

# Learns vocabulary and IDF weights from the corpus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the function name.

Suggested change
# Learns vocabulary and IDF weights from the corpus.
# @rbs (Array[String]) -> self

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 52:52

Comment:
**style:** Comment restates the function name.

```suggestion
    # @rbs (Array[String]) -> self
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs (Array[String]) -> self
Comment on lines +52 to +53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the method signature.

Suggested change
# Learns vocabulary and IDF weights from the corpus.
# @rbs (Array[String]) -> self
# @rbs (Array[String]) -> self
def fit(documents)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 52:53

Comment:
**style:** Comment restates the method signature.

```suggestion
    # @rbs (Array[String]) -> self
    def fit(documents)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def fit(documents)
raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on .each. Remove unless documents actually need array-specific methods beyond iteration.

Suggested change
raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
raise ArgumentError, 'documents cannot be empty' if documents.empty?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 55:55

Comment:
**style:** Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on `.each`. Remove unless documents actually need array-specific methods beyond iteration.

```suggestion
      raise ArgumentError, 'documents cannot be empty' if documents.empty?
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

raise ArgumentError, 'documents cannot be empty' if documents.empty?

@num_documents = documents.size
document_frequencies = Hash.new(0)

documents.each do |doc|
terms = extract_terms(doc)
terms.each_key { |term| document_frequencies[term] += 1 }
end

@vocabulary = {}
@idf = {}
vocab_index = 0

document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index
vocab_index += 1
Comment on lines +68 to +74
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Manual counter increment. Use each.with_index:

Suggested change
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
document_frequencies.each.with_index do |term_df, vocab_index|
term, df = term_df
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 68:74

Comment:
**style:** Manual counter increment. Use `each.with_index`:

```suggestion
      document_frequencies.each.with_index do |term_df, vocab_index|
        term, df = term_df
        next unless within_df_bounds?(df, @num_documents)

        @vocabulary[term] = vocab_index
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +66 to +74
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Resetting @vocabulary and @idf to empty hashes just to reassign them. These were already initialized empty in initialize. Remove redundant assignments:

Suggested change
@vocabulary = {}
@idf = {}
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:74

Comment:
**style:** Resetting `@vocabulary` and `@idf` to empty hashes just to reassign them. These were already initialized empty in `initialize`. Remove redundant assignments:

```suggestion
      vocab_index = 0

      document_frequencies.each do |term, df|
        next unless within_df_bounds?(df, @num_documents)

        @vocabulary[term] = vocab_index
        vocab_index += 1
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.


# IDF: log((N + 1) / (df + 1)) + 1
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
Comment on lines +76 to +77
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The formula is already in the code. "with smoothing" is obvious from the + 1 terms. Make concise or remove.

Suggested change
# IDF: log((N + 1) / (df + 1)) + 1
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
# IDF with smoothing
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 76:77

Comment:
**style:** The formula is already in the code. "with smoothing" is obvious from the `+ 1` terms. Make concise or remove.

```suggestion
        # IDF with smoothing
        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

end
Comment on lines +66 to +78
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Intermediate variable vocab_index only increments. Consider using each.with_index:

Suggested change
@vocabulary = {}
@idf = {}
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
# IDF: log((N + 1) / (df + 1)) + 1
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end
selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
selected_terms.each.with_index do |(term, df), vocab_index|
@vocabulary[term] = vocab_index
# IDF with smoothing
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end

Context Used: Context from dashboard - CLAUDE.md (source). Is the vocabulary index ordering requirement flexible enough for this refactor?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:78

Comment:
**style:** Intermediate variable `vocab_index` only increments. Consider using `each.with_index`:

```suggestion
      selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
      
      selected_terms.each.with_index do |(term, df), vocab_index|
        @vocabulary[term] = vocab_index

        # IDF with smoothing
        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
      end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)). Is the vocabulary index ordering requirement flexible enough for this refactor?

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.


@fitted = true
self
end

# Transforms a document into a normalized TF-IDF vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment just repeats what the code does.

Suggested change
# Transforms a document into a normalized TF-IDF vector.
# @rbs (String) -> Hash[Symbol, Float]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:84

Comment:
**style:** Comment just repeats what the code does.

```suggestion
    # @rbs (String) -> Hash[Symbol, Float]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs (String) -> Hash[Symbol, Float]
Comment on lines +84 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the method name.

Suggested change
# Transforms a document into a normalized TF-IDF vector.
# @rbs (String) -> Hash[Symbol, Float]
# @rbs (String) -> Hash[Symbol, Float]
def transform(document)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:85

Comment:
**style:** Comment restates the method name.

```suggestion
    # @rbs (String) -> Hash[Symbol, Float]
    def transform(document)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def transform(document)
raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted

terms = extract_terms(document)
result = {} #: Hash[Symbol, Float]

terms.each do |term, tf|
next unless @vocabulary.key?(term)

tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f
result[term] = (tf_value * @idf[term]).to_f
end

normalize_vector(result)
end

# Fits and transforms in one step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates function signature.

Suggested change
# Fits and transforms in one step.
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:102

Comment:
**style:** Comment restates function signature.

```suggestion
    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
Comment on lines +102 to +103
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates what the method name clearly indicates.

Suggested change
# Fits and transforms in one step.
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
def fit_transform(documents)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:103

Comment:
**style:** Comment restates what the method name clearly indicates.

```suggestion
    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
    def fit_transform(documents)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def fit_transform(documents)
fit(documents)
documents.map { |doc| transform(doc) }
end

# Returns vocabulary terms in index order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates what's obvious from the method name.

Suggested change
# Returns vocabulary terms in index order.
# @rbs () -> Array[Symbol]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:109

Comment:
**style:** Comment restates what's obvious from the method name.

```suggestion
    # @rbs () -> Array[Symbol]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs () -> Array[Symbol]
Comment on lines +109 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the method signature.

Suggested change
# Returns vocabulary terms in index order.
# @rbs () -> Array[Symbol]
# @rbs () -> Array[Symbol]
def feature_names

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:110

Comment:
**style:** Comment restates the method signature.

```suggestion
    # @rbs () -> Array[Symbol]
    def feature_names
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def feature_names
@vocabulary.keys.sort_by { |term| @vocabulary[term] }
end

# @rbs () -> bool
def fitted?
@fitted
end

# @rbs (?untyped) -> Hash[Symbol, untyped]
def as_json(_options = nil)
{
version: 1,
type: 'tfidf',
min_df: @min_df,
max_df: @max_df,
ngram_range: @ngram_range,
sublinear_tf: @sublinear_tf,
vocabulary: @vocabulary,
idf: @idf,
num_documents: @num_documents,
fitted: @fitted
}
end

# @rbs (?untyped) -> String
def to_json(_options = nil)
JSON.generate(as_json)
end

# Loads a vectorizer from JSON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates function signature.

Suggested change
# Loads a vectorizer from JSON.
# @rbs (String | Hash[String, untyped]) -> TFIDF

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:141

Comment:
**style:** Comment restates function signature.

```suggestion
    # @rbs (String | Hash[String, untyped]) -> TFIDF
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

# @rbs (String | Hash[String, untyped]) -> TFIDF
Comment on lines +141 to +142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates method signature.

Suggested change
# Loads a vectorizer from JSON.
# @rbs (String | Hash[String, untyped]) -> TFIDF
# @rbs (String | Hash[String, untyped]) -> TFIDF
def self.from_json(json)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:142

Comment:
**style:** Comment restates method signature.

```suggestion
    # @rbs (String | Hash[String, untyped]) -> TFIDF
    def self.from_json(json)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

def self.from_json(json)
data = json.is_a?(String) ? JSON.parse(json) : json
raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf'

instance = new(
min_df: data['min_df'],
max_df: data['max_df'],
ngram_range: data['ngram_range'],
sublinear_tf: data['sublinear_tf']
)

instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary']))
instance.instance_variable_set(:@idf, symbolize_keys(data['idf']))
instance.instance_variable_set(:@num_documents, data['num_documents'])
instance.instance_variable_set(:@fitted, data['fitted'])

instance
end

# @rbs () -> Array[untyped]
def marshal_dump
[@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted]
end

# @rbs (Array[untyped]) -> void
def marshal_load(data)
@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted = data
end

private

# @rbs (String) -> Hash[Symbol, Integer]
def extract_terms(document)
result = Hash.new(0)

if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end

return result if @ngram_range[1] <= 1

tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end

result
end

# @rbs (String) -> Array[String]
def tokenize_for_ngrams(document)
document
.gsub(/[^\w\s]/, '')
.split
.map(&:downcase)
.reject { |w| w.length <= 2 || String::CORPUS_SKIP_WORDS.include?(w) }
.map(&:stem)
end

# @rbs (Array[String], Integer) -> Array[Symbol]
def generate_ngrams(tokens, n) # rubocop:disable Naming/MethodParameterName
return [] if tokens.size < n

tokens.each_cons(n).map { |gram| gram.join('_').intern }
end

# @rbs (Integer, Integer) -> bool
def within_df_bounds?(doc_freq, num_docs)
doc_freq.between?(
@min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,
@max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
)
end

# @rbs (Hash[Symbol, Float]) -> Hash[Symbol, Float]
def normalize_vector(vector)
return vector if vector.empty?

magnitude = Math.sqrt(vector.values.sum { |v| v * v })
return vector if magnitude.zero?

vector.transform_values { |v| v / magnitude }
end

# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
end
Comment on lines +230 to +235
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - use early returns:

Suggested change
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
end
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
return if value.is_a?(Integer) && !value.negative?
return if value.is_a?(Float) && value.between?(0.0, 1.0)
raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 230:235

Comment:
**style:** Nested conditionals - use early returns:

```suggestion
    def validate_df!(value, name)
      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
      return if value.is_a?(Integer) && !value.negative?
      return if value.is_a?(Float) && value.between?(0.0, 1.0)

      raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.


# @rbs (Array[Integer]) -> void
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'
end
Comment on lines +240 to +242
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Double negative with unless. Use early return for clearer logic:

Suggested change
unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'
end
return if range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 240:242

Comment:
**style:** Double negative with `unless`. Use early return for clearer logic:

```suggestion
      return if range.all?(Integer) && range.all?(&:positive?)

      raise ArgumentError, 'ngram_range values must be positive integers'
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end

# @rbs (Hash[String, untyped]) -> Hash[Symbol, untyped]
def self.symbolize_keys(hash)
hash.transform_keys(&:to_sym)
end
private_class_method :symbolize_keys
end
end
Loading