Skip to content

Conversation

@cardmagic
Copy link
Owner

Summary

Adds a standalone TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer for text feature extraction.

  • fit/transform/fit_transform API (scikit-learn style)
  • Vocabulary filtering via min_df/max_df thresholds
  • N-gram support (unigrams, bigrams, trigrams)
  • Sublinear TF scaling (1 + log(tf))
  • L2 normalized output vectors
  • JSON and Marshal serialization

Leverages existing word_hash infrastructure for term frequency extraction with stemming and stopword removal.

Example

tfidf = Classifier::TFIDF.new(sublinear_tf: true)
tfidf.fit(["Dogs are great pets", "Cats are independent"])
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}

Test plan

  • 37 new tests covering all features
  • All 327 tests pass
  • Rubocop clean
  • README updated with TF-IDF section

Closes #104

Adds standalone TF-IDF vectorizer for text feature extraction.
Foundation for classic text classification approaches like logistic
regression and improved LSI quality.

Features:
- fit/transform/fit_transform API (scikit-learn style)
- Vocabulary filtering via min_df/max_df thresholds
- N-gram support (unigrams, bigrams, trigrams)
- Sublinear TF scaling (1 + log(tf))
- L2 normalized output vectors
- JSON and Marshal serialization

Leverages existing word_hash infrastructure for term frequency
extraction with stemming and stopword removal.

Closes #104
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 29, 2025

Greptile Summary

Adds a well-designed TF-IDF vectorizer with scikit-learn style API (fit/transform/fit_transform). The implementation is solid with comprehensive test coverage (37 tests), proper serialization support (JSON and Marshal), and integration with existing word_hash infrastructure.

Key additions:

  • Vocabulary filtering via min_df/max_df thresholds
  • N-gram support (unigrams through trigrams)
  • Sublinear TF scaling option
  • L2 normalized output vectors
  • Complete documentation in README

Code quality observations:
The PR has already been through several review rounds addressing most style issues. A few minor opportunities remain to simplify the code further by removing defensive type checks and redundant assignments.

Confidence Score: 4/5

  • This PR is safe to merge with minor style refinements recommended
  • The implementation is functionally solid with comprehensive test coverage and proper integration. The core TF-IDF algorithm is correct, and serialization works properly. Score reduced from 5 to 4 only due to minor style issues (defensive type checks, redundant assignments, some double negatives) that don't affect correctness but could be simplified per the style guide.
  • lib/classifier/tfidf.rb has a few minor style improvements pending from previous review threads, but no files have critical issues

Important Files Changed

Filename Overview
lib/classifier/tfidf.rb New TF-IDF vectorizer implementation with fit/transform API - contains excessive comments and some nested conditionals that were flagged in previous reviews
README.md Added comprehensive TF-IDF documentation with examples - well-structured and clear
test/tfidf/tfidf_test.rb Comprehensive test suite with 37 tests covering all TF-IDF features - thorough coverage

Sequence Diagram

sequenceDiagram
    participant User
    participant TFIDF
    participant WordHash as String#clean_word_hash
    participant Math

    Note over User,TFIDF: Fit Phase - Learn Vocabulary & IDF
    User->>TFIDF: fit(documents)
    TFIDF->>TFIDF: validate documents array
    loop For each document
        TFIDF->>TFIDF: extract_terms(doc)
        TFIDF->>WordHash: clean_word_hash (unigrams)
        WordHash-->>TFIDF: term frequencies
        alt ngram_range[1] > 1
            TFIDF->>TFIDF: tokenize_for_ngrams(doc)
            TFIDF->>TFIDF: generate_ngrams(tokens, n)
        end
        TFIDF->>TFIDF: count document frequencies
    end
    loop For each term
        TFIDF->>TFIDF: within_df_bounds?(df, num_docs)
        alt term passes min_df/max_df filters
            TFIDF->>Math: log((N+1)/(df+1)) + 1
            Math-->>TFIDF: IDF value
            TFIDF->>TFIDF: add to vocabulary & idf
        end
    end
    TFIDF->>TFIDF: set fitted = true
    TFIDF-->>User: self

    Note over User,TFIDF: Transform Phase - Convert Text to TF-IDF Vector
    User->>TFIDF: transform(document)
    TFIDF->>TFIDF: check fitted?
    TFIDF->>TFIDF: extract_terms(document)
    TFIDF->>WordHash: clean_word_hash
    WordHash-->>TFIDF: term frequencies
    loop For each term in vocabulary
        alt sublinear_tf enabled
            TFIDF->>Math: 1 + log(tf)
            Math-->>TFIDF: sublinear TF
        end
        TFIDF->>TFIDF: tf_value * idf[term]
    end
    TFIDF->>TFIDF: normalize_vector (L2 norm)
    TFIDF->>Math: sqrt(sum of squares)
    Math-->>TFIDF: magnitude
    TFIDF->>TFIDF: divide by magnitude
    TFIDF-->>User: normalized TF-IDF vector
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 10 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +30 to +33
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.

Suggested change
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:33

Comment:
**style:** Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.

```suggestion
    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@fitted = false
end

# Learns vocabulary and IDF weights from the corpus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the function name.

Suggested change
# Learns vocabulary and IDF weights from the corpus.
# @rbs (Array[String]) -> self

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 52:52

Comment:
**style:** Comment restates the function name.

```suggestion
    # @rbs (Array[String]) -> self
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@vocabulary[term] = vocab_index
vocab_index += 1

# IDF: log((N + 1) / (df + 1)) + 1 with smoothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment explains the formula but "with smoothing" is already obvious from the + 1 terms. Make more concise.

Suggested change
# IDF: log((N + 1) / (df + 1)) + 1 with smoothing
# IDF: log((N + 1) / (df + 1)) + 1

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 76:76

Comment:
**style:** Comment explains the formula but "with smoothing" is already obvious from the `+ 1` terms. Make more concise.

```suggestion
        # IDF: log((N + 1) / (df + 1)) + 1
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

self
end

# Transforms a document into a normalized TF-IDF vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment just repeats what the code does.

Suggested change
# Transforms a document into a normalized TF-IDF vector.
# @rbs (String) -> Hash[Symbol, Float]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:84

Comment:
**style:** Comment just repeats what the code does.

```suggestion
    # @rbs (String) -> Hash[Symbol, Float]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

normalize_vector(result)
end

# Fits and transforms in one step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates function signature.

Suggested change
# Fits and transforms in one step.
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:102

Comment:
**style:** Comment restates function signature.

```suggestion
    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

documents.map { |doc| transform(doc) }
end

# Returns vocabulary terms in index order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates what's obvious from the method name.

Suggested change
# Returns vocabulary terms in index order.
# @rbs () -> Array[Symbol]

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:109

Comment:
**style:** Comment restates what's obvious from the method name.

```suggestion
    # @rbs () -> Array[Symbol]
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

as_json.to_json
end

# Loads a vectorizer from JSON.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates function signature.

Suggested change
# Loads a vectorizer from JSON.
# @rbs (String | Hash[String, untyped]) -> TFIDF

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:141

Comment:
**style:** Comment restates function signature.

```suggestion
    # @rbs (String | Hash[String, untyped]) -> TFIDF
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 175 to 192
def extract_terms(document)
result = Hash.new(0)

if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end

if @ngram_range[1] > 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
end

result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - refactor with early returns per style guide.

Suggested change
def extract_terms(document)
result = Hash.new(0)
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
if @ngram_range[1] > 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
end
result
# @rbs (String) -> Hash[Symbol, Integer]
def extract_terms(document)
result = Hash.new(0)
return result if @ngram_range[1] == 1 && @ngram_range[0] > 1
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
return result if @ngram_range[1] <= 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
result
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 175:192

Comment:
**style:** Nested conditionals - refactor with early returns per style guide.

```suggestion
    # @rbs (String) -> Hash[Symbol, Integer]
    def extract_terms(document)
      result = Hash.new(0)

      return result if @ngram_range[1] == 1 && @ngram_range[0] > 1
      
      if @ngram_range[0] <= 1
        word_hash = document.clean_word_hash
        word_hash.each { |term, count| result[term] += count }
      end

      return result if @ngram_range[1] <= 1

      tokens = tokenize_for_ngrams(document)
      (2..@ngram_range[1]).each do |n|
        next if n < @ngram_range[0]
        generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
      end

      result
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 230 to 239
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
if value.is_a?(Float)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
elsif value.is_a?(Integer)
raise ArgumentError, "#{name} must be non-negative" if value.negative?
else
raise ArgumentError, "#{name} must be an Integer or Float"
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - use early returns instead.

Suggested change
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
if value.is_a?(Float)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
elsif value.is_a?(Integer)
raise ArgumentError, "#{name} must be non-negative" if value.negative?
else
raise ArgumentError, "#{name} must be an Integer or Float"
end
end
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
if value.is_a?(Float)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
return
end
raise ArgumentError, "#{name} must be non-negative" if value.negative?
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 230:239

Comment:
**style:** Nested conditionals - use early returns instead.

```suggestion
    # @rbs (Integer | Float, String) -> void
    def validate_df!(value, name)
      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
      
      if value.is_a?(Float)
        raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
        return
      end
      
      raise ArgumentError, "#{name} must be non-negative" if value.negative?
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 242 to 250
def validate_ngram_range!(range)
valid_structure = range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure

valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? }
raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Extract intermediate variables that are only used once - simplify by inlining.

Suggested change
def validate_ngram_range!(range)
valid_structure = range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure
valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? }
raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
# @rbs (Array[Integer]) -> void
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? }
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 242:250

Comment:
**style:** Extract intermediate variables that are only used once - simplify by inlining.

```suggestion
    # @rbs (Array[Integer]) -> void
    def validate_ngram_range!(range)
      raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
      raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? }
      raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Replace nested if/elsif/else conditionals in validate_df! and
validate_ngram_range! with guard clauses for better readability.
Inline single-use intermediate variables. Remove redundant "with
smoothing" from IDF formula comment.

Addresses style feedback from PR #107 review.
@cardmagic
Copy link
Owner Author

@greptile-apps Thanks for the review! I've addressed the structural issues:

  • ✅ Removed "with smoothing" from the IDF comment (line 76)
  • ✅ Refactored validate_df! to use early returns instead of nested conditionals
  • ✅ Inlined intermediate variables in validate_ngram_range!

However, I've intentionally kept the method documentation comments (lines 52, 84, 102, 109, 141). These serve as RDoc documentation for the public API and are valuable for users of the gem who rely on generated documentation. The style guide's guidance against redundant comments applies to inline implementation comments, not API documentation.

Could you please re-review when you get a chance?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 12 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 175 to 192
def extract_terms(document)
result = Hash.new(0)

if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end

if @ngram_range[1] > 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
end

result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - flatten with early returns:

Suggested change
def extract_terms(document)
result = Hash.new(0)
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
if @ngram_range[1] > 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
end
result
def extract_terms(document)
result = Hash.new(0)
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
return result if @ngram_range[1] <= 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
result
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 175:192

Comment:
**style:** Nested conditionals - flatten with early returns:

```suggestion
    def extract_terms(document)
      result = Hash.new(0)

      if @ngram_range[0] <= 1
        word_hash = document.clean_word_hash
        word_hash.each { |term, count| result[term] += count }
      end

      return result if @ngram_range[1] <= 1

      tokens = tokenize_for_ngrams(document)
      (2..@ngram_range[1]).each do |n|
        next if n < @ngram_range[0]

        generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
      end

      result
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +230 to +235
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - use early returns:

Suggested change
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
end
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
return if value.is_a?(Integer) && !value.negative?
return if value.is_a?(Float) && value.between?(0.0, 1.0)
raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 230:235

Comment:
**style:** Nested conditionals - use early returns:

```suggestion
    def validate_df!(value, name)
      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
      return if value.is_a?(Integer) && !value.negative?
      return if value.is_a?(Float) && value.between?(0.0, 1.0)

      raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 238 to 244
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v|
v.is_a?(Integer) && v.positive?
end
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Nested conditionals - flatten validation:

Suggested change
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v|
v.is_a?(Integer) && v.positive?
end
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 238:244

Comment:
**style:** Nested conditionals - flatten validation:

```suggestion
    def validate_ngram_range!(range)
      raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
      raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?)
      raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 214 to 217
min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df
max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df

doc_freq.between?(min_count, max_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Extract intermediate variables that are used only once - inline directly:

Suggested change
min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df
max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
doc_freq.between?(min_count, max_count)
def within_df_bounds?(doc_freq, num_docs)
doc_freq.between?(
@min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,
@max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
)
end

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 214:217

Comment:
**style:** Extract intermediate variables that are used only once - inline directly:

```suggestion
    def within_df_bounds?(doc_freq, num_docs)
      doc_freq.between?(
        @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,
        @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
      )
    end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +30 to +34
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comments restate what's obvious from parameter names and code. Per style guide: remove.

Suggested change
# Creates a new TF-IDF vectorizer.
# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
# - sublinear_tf: use 1 + log(tf) instead of raw term frequency
#
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:34

Comment:
**style:** Comments restate what's obvious from parameter names and code. Per style guide: remove.

```suggestion
    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +84 to +85
# Transforms a document into a normalized TF-IDF vector.
# @rbs (String) -> Hash[Symbol, Float]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the method name.

Suggested change
# Transforms a document into a normalized TF-IDF vector.
# @rbs (String) -> Hash[Symbol, Float]
# @rbs (String) -> Hash[Symbol, Float]
def transform(document)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:85

Comment:
**style:** Comment restates the method name.

```suggestion
    # @rbs (String) -> Hash[Symbol, Float]
    def transform(document)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +102 to +103
# Fits and transforms in one step.
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates what the method name clearly indicates.

Suggested change
# Fits and transforms in one step.
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
def fit_transform(documents)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:103

Comment:
**style:** Comment restates what the method name clearly indicates.

```suggestion
    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
    def fit_transform(documents)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +109 to +110
# Returns vocabulary terms in index order.
# @rbs () -> Array[Symbol]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates the method signature.

Suggested change
# Returns vocabulary terms in index order.
# @rbs () -> Array[Symbol]
# @rbs () -> Array[Symbol]
def feature_names

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:110

Comment:
**style:** Comment restates the method signature.

```suggestion
    # @rbs () -> Array[Symbol]
    def feature_names
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +141 to +142
# Loads a vectorizer from JSON.
# @rbs (String | Hash[String, untyped]) -> TFIDF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Comment restates method signature.

Suggested change
# Loads a vectorizer from JSON.
# @rbs (String | Hash[String, untyped]) -> TFIDF
# @rbs (String | Hash[String, untyped]) -> TFIDF
def self.from_json(json)

Context Used: Context from dashboard - CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:142

Comment:
**style:** Comment restates method signature.

```suggestion
    # @rbs (String | Hash[String, untyped]) -> TFIDF
    def self.from_json(json)
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +66 to +78
@vocabulary = {}
@idf = {}
vocab_index = 0

document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index
vocab_index += 1

# IDF: log((N + 1) / (df + 1)) + 1
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Intermediate variable vocab_index only increments. Consider using each.with_index:

Suggested change
@vocabulary = {}
@idf = {}
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
# IDF: log((N + 1) / (df + 1)) + 1
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end
selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
selected_terms.each.with_index do |(term, df), vocab_index|
@vocabulary[term] = vocab_index
# IDF with smoothing
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end

Context Used: Context from dashboard - CLAUDE.md (source). Is the vocabulary index ordering requirement flexible enough for this refactor?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:78

Comment:
**style:** Intermediate variable `vocab_index` only increments. Consider using `each.with_index`:

```suggestion
      selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
      
      selected_terms.each.with_index do |(term, df), vocab_index|
        @vocabulary[term] = vocab_index

        # IDF with smoothing
        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
      end
```

**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)). Is the vocabulary index ordering requirement flexible enough for this refactor?

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Add type annotation to empty hash in transform method
- Use JSON.generate instead of Hash#to_json for type safety
- Cast multiplication result to Float for type checker
- Auto-fix Minitest assertion style in tests
@cardmagic
Copy link
Owner Author

@greptile-apps Thanks for the follow-up review! Here's what was addressed:

Applied:

  • extract_terms: Early return when ngram_range[1] <= 1
  • validate_ngram_range!: Simplified with range.all?(Integer) && range.all?(&:positive?)
  • within_df_bounds?: Inlined single-use variables

Not applied (intentionally kept):

  1. validate_df! (lines 230-235): The current code already uses guard clauses - each line raises immediately on invalid input. The suggestion adds early returns for valid cases then a conditional raise, which is more complex, not simpler.

  2. initialize comments (lines 30-34): These explain parameter semantics, not just names:

    • min_df/max_df: Integer vs Float behavior differs (count vs proportion)
    • ngram_range: Example values like [1,2] help users
    • sublinear_tf: Explains the 1 + log(tf) formula

    This is useful API documentation for gem users, not redundant commentary.

Could you please re-review?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +68 to +74
vocab_index = 0

document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index
vocab_index += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Manual counter increment. Use each.with_index:

Suggested change
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
document_frequencies.each.with_index do |term_df, vocab_index|
term, df = term_df
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 68:74

Comment:
**style:** Manual counter increment. Use `each.with_index`:

```suggestion
      document_frequencies.each.with_index do |term_df, vocab_index|
        term, df = term_df
        next unless within_df_bounds?(df, @num_documents)

        @vocabulary[term] = vocab_index
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic
Copy link
Owner Author

@greptile-apps One more note on the with_index suggestion for the vocabulary indexing (lines 68-74):

This change would introduce a bug. The current manual counter only increments when a term passes the within_df_bounds? check. Using with_index would assign indices for all terms including skipped ones:

Current (correct):

Term A passes → index 0
Term B skipped → no index  
Term C passes → index 1

With with_index (broken):

Term A passes → index 0
Term B skipped → no index
Term C passes → index 2 ❌ (gap in indices)

The vocabulary indices need to be contiguous for the vectorizer to work correctly.

Could you please re-review when you get a chance?

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

# Learns vocabulary and IDF weights from the corpus.
# @rbs (Array[String]) -> self
def fit(documents)
raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on .each. Remove unless documents actually need array-specific methods beyond iteration.

Suggested change
raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
raise ArgumentError, 'documents cannot be empty' if documents.empty?

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 55:55

Comment:
**style:** Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on `.each`. Remove unless documents actually need array-specific methods beyond iteration.

```suggestion
      raise ArgumentError, 'documents cannot be empty' if documents.empty?
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +240 to +242
unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Double negative with unless. Use early return for clearer logic:

Suggested change
unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'
end
return if range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 240:242

Comment:
**style:** Double negative with `unless`. Use early return for clearer logic:

```suggestion
      return if range.all?(Integer) && range.all?(&:positive?)

      raise ArgumentError, 'ngram_range values must be positive integers'
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +66 to +74
@vocabulary = {}
@idf = {}
vocab_index = 0

document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index
vocab_index += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Resetting @vocabulary and @idf to empty hashes just to reassign them. These were already initialized empty in initialize. Remove redundant assignments:

Suggested change
@vocabulary = {}
@idf = {}
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:74

Comment:
**style:** Resetting `@vocabulary` and `@idf` to empty hashes just to reassign them. These were already initialized empty in `initialize`. Remove redundant assignments:

```suggestion
      vocab_index = 0

      document_frequencies.each do |term, df|
        next unless within_df_bounds?(df, @num_documents)

        @vocabulary[term] = vocab_index
        vocab_index += 1
```

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@cardmagic cardmagic merged commit 04324e6 into master Dec 29, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add TF-IDF vectorizer

2 participants