feat(tfidf): add TF-IDF vectorizer #107

cardmagic · 2025-12-29T03:06:03Z

Summary

Adds a standalone TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer for text feature extraction.

fit/transform/fit_transform API (scikit-learn style)
Vocabulary filtering via min_df/max_df thresholds
N-gram support (unigrams, bigrams, trigrams)
Sublinear TF scaling (1 + log(tf))
L2 normalized output vectors
JSON and Marshal serialization

Leverages existing word_hash infrastructure for term frequency extraction with stemming and stopword removal.

Example

tfidf = Classifier::TFIDF.new(sublinear_tf: true)
tfidf.fit(["Dogs are great pets", "Cats are independent"])
vector = tfidf.transform("Dogs are loyal")
# => {:dog=>0.7071..., :loyal=>0.7071...}

Test plan

37 new tests covering all features
All 327 tests pass
Rubocop clean
README updated with TF-IDF section

Closes #104

Adds standalone TF-IDF vectorizer for text feature extraction. Foundation for classic text classification approaches like logistic regression and improved LSI quality. Features: - fit/transform/fit_transform API (scikit-learn style) - Vocabulary filtering via min_df/max_df thresholds - N-gram support (unigrams, bigrams, trigrams) - Sublinear TF scaling (1 + log(tf)) - L2 normalized output vectors - JSON and Marshal serialization Leverages existing word_hash infrastructure for term frequency extraction with stemming and stopword removal. Closes #104

greptile-apps · 2025-12-29T03:10:06Z

Greptile Summary

Adds a well-designed TF-IDF vectorizer with scikit-learn style API (fit/transform/fit_transform). The implementation is solid with comprehensive test coverage (37 tests), proper serialization support (JSON and Marshal), and integration with existing word_hash infrastructure.

Key additions:

Vocabulary filtering via min_df/max_df thresholds
N-gram support (unigrams through trigrams)
Sublinear TF scaling option
L2 normalized output vectors
Complete documentation in README

Code quality observations:
The PR has already been through several review rounds addressing most style issues. A few minor opportunities remain to simplify the code further by removing defensive type checks and redundant assignments.

Confidence Score: 4/5

This PR is safe to merge with minor style refinements recommended
The implementation is functionally solid with comprehensive test coverage and proper integration. The core TF-IDF algorithm is correct, and serialization works properly. Score reduced from 5 to 4 only due to minor style issues (defensive type checks, redundant assignments, some double negatives) that don't affect correctness but could be simplified per the style guide.
lib/classifier/tfidf.rb has a few minor style improvements pending from previous review threads, but no files have critical issues

Important Files Changed

Filename	Overview
lib/classifier/tfidf.rb	New TF-IDF vectorizer implementation with fit/transform API - contains excessive comments and some nested conditionals that were flagged in previous reviews
README.md	Added comprehensive TF-IDF documentation with examples - well-structured and clear
test/tfidf/tfidf_test.rb	Comprehensive test suite with 37 tests covering all TF-IDF features - thorough coverage

Sequence Diagram

sequenceDiagram
    participant User
    participant TFIDF
    participant WordHash as String#clean_word_hash
    participant Math

    Note over User,TFIDF: Fit Phase - Learn Vocabulary & IDF
    User->>TFIDF: fit(documents)
    TFIDF->>TFIDF: validate documents array
    loop For each document
        TFIDF->>TFIDF: extract_terms(doc)
        TFIDF->>WordHash: clean_word_hash (unigrams)
        WordHash-->>TFIDF: term frequencies
        alt ngram_range[1] > 1
            TFIDF->>TFIDF: tokenize_for_ngrams(doc)
            TFIDF->>TFIDF: generate_ngrams(tokens, n)
        end
        TFIDF->>TFIDF: count document frequencies
    end
    loop For each term
        TFIDF->>TFIDF: within_df_bounds?(df, num_docs)
        alt term passes min_df/max_df filters
            TFIDF->>Math: log((N+1)/(df+1)) + 1
            Math-->>TFIDF: IDF value
            TFIDF->>TFIDF: add to vocabulary & idf
        end
    end
    TFIDF->>TFIDF: set fitted = true
    TFIDF-->>User: self

    Note over User,TFIDF: Transform Phase - Convert Text to TF-IDF Vector
    User->>TFIDF: transform(document)
    TFIDF->>TFIDF: check fitted?
    TFIDF->>TFIDF: extract_terms(document)
    TFIDF->>WordHash: clean_word_hash
    WordHash-->>TFIDF: term frequencies
    loop For each term in vocabulary
        alt sublinear_tf enabled
            TFIDF->>Math: 1 + log(tf)
            Math-->>TFIDF: sublinear TF
        end
        TFIDF->>TFIDF: tf_value * idf[term]
    end
    TFIDF->>TFIDF: normalize_vector (L2 norm)
    TFIDF->>Math: sqrt(sum of squares)
    Math-->>TFIDF: magnitude
    TFIDF->>TFIDF: divide by magnitude
    TFIDF-->>User: normalized TF-IDF vector

greptile-apps

_{5 files reviewed, 10 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-29T03:09:59Z

lib/classifier/tfidf.rb

+    # Creates a new TF-IDF vectorizer.
+    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
+    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
+    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency


style: Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.

Suggested change

# Creates a new TF-IDF vectorizer.

# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)

# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams

# - sublinear_tf: use 1 + log(tf) instead of raw term frequency

# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,

# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void

def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 30:33 Comment: **style:** Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows. ```suggestion # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float, # ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:00Z

lib/classifier/tfidf.rb

+      @fitted = false
+    end
+
+    # Learns vocabulary and IDF weights from the corpus.


style: Comment restates the function name.

Suggested change

# Learns vocabulary and IDF weights from the corpus.

# @rbs (Array[String]) -> self

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 52:52 Comment: **style:** Comment restates the function name. ```suggestion # @rbs (Array[String]) -> self ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:01Z

lib/classifier/tfidf.rb

+        @vocabulary[term] = vocab_index
+        vocab_index += 1
+
+        # IDF: log((N + 1) / (df + 1)) + 1 with smoothing


style: Comment explains the formula but "with smoothing" is already obvious from the + 1 terms. Make more concise.

Suggested change

# IDF: log((N + 1) / (df + 1)) + 1 with smoothing

# IDF: log((N + 1) / (df + 1)) + 1

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 76:76 Comment: **style:** Comment explains the formula but "with smoothing" is already obvious from the `+ 1` terms. Make more concise. ```suggestion # IDF: log((N + 1) / (df + 1)) + 1 ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:02Z

lib/classifier/tfidf.rb

+      self
+    end
+
+    # Transforms a document into a normalized TF-IDF vector.


style: Comment just repeats what the code does.

Suggested change

# Transforms a document into a normalized TF-IDF vector.

# @rbs (String) -> Hash[Symbol, Float]

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 84:84 Comment: **style:** Comment just repeats what the code does. ```suggestion # @rbs (String) -> Hash[Symbol, Float] ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:02Z

lib/classifier/tfidf.rb

+      normalize_vector(result)
+    end
+
+    # Fits and transforms in one step.


style: Comment restates function signature.

Suggested change

# Fits and transforms in one step.

# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 102:102 Comment: **style:** Comment restates function signature. ```suggestion # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:03Z

lib/classifier/tfidf.rb

+      documents.map { |doc| transform(doc) }
+    end
+
+    # Returns vocabulary terms in index order.


style: Comment restates what's obvious from the method name.

Suggested change

# Returns vocabulary terms in index order.

# @rbs () -> Array[Symbol]

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 109:109 Comment: **style:** Comment restates what's obvious from the method name. ```suggestion # @rbs () -> Array[Symbol] ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:04Z

lib/classifier/tfidf.rb

+      as_json.to_json
+    end
+
+    # Loads a vectorizer from JSON.


style: Comment restates function signature.

Suggested change

# Loads a vectorizer from JSON.

# @rbs (String | Hash[String, untyped]) -> TFIDF

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 141:141 Comment: **style:** Comment restates function signature. ```suggestion # @rbs (String | Hash[String, untyped]) -> TFIDF ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:04Z

lib/classifier/tfidf.rb

+    def extract_terms(document)
+      result = Hash.new(0)
+
+      if @ngram_range[0] <= 1
+        word_hash = document.clean_word_hash
+        word_hash.each { |term, count| result[term] += count }
+      end
+
+      if @ngram_range[1] > 1
+        tokens = tokenize_for_ngrams(document)
+        (2..@ngram_range[1]).each do |n|
+          next if n < @ngram_range[0]
+
+          generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
+        end
+      end
+
+      result


style: Nested conditionals - refactor with early returns per style guide.

Suggested change

def extract_terms(document)

result = Hash.new(0)

if @ngram_range[0] <= 1

word_hash = document.clean_word_hash

word_hash.each { |term, count| result[term] += count }

end

if @ngram_range[1] > 1

tokens = tokenize_for_ngrams(document)

(2..@ngram_range[1]).each do |n|

next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }

end

end

result

# @rbs (String) -> Hash[Symbol, Integer]

def extract_terms(document)

result = Hash.new(0)

return result if @ngram_range[1] == 1 && @ngram_range[0] > 1

if @ngram_range[0] <= 1

word_hash = document.clean_word_hash

word_hash.each { |term, count| result[term] += count }

end

return result if @ngram_range[1] <= 1

tokens = tokenize_for_ngrams(document)

(2..@ngram_range[1]).each do |n|

next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }

end

result

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 175:192 Comment: **style:** Nested conditionals - refactor with early returns per style guide. ```suggestion # @rbs (String) -> Hash[Symbol, Integer] def extract_terms(document) result = Hash.new(0) return result if @ngram_range[1] == 1 && @ngram_range[0] > 1 if @ngram_range[0] <= 1 word_hash = document.clean_word_hash word_hash.each { |term, count| result[term] += count } end return result if @ngram_range[1] <= 1 tokens = tokenize_for_ngrams(document) (2..@ngram_range[1]).each do |n| next if n < @ngram_range[0] generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } end result end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:05Z

lib/classifier/tfidf.rb

+    # @rbs (Integer | Float, String) -> void
+    def validate_df!(value, name)
+      if value.is_a?(Float)
+        raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
+      elsif value.is_a?(Integer)
+        raise ArgumentError, "#{name} must be non-negative" if value.negative?
+      else
+        raise ArgumentError, "#{name} must be an Integer or Float"
+      end
+    end


style: Nested conditionals - use early returns instead.

Suggested change

# @rbs (Integer | Float, String) -> void

def validate_df!(value, name)

if value.is_a?(Float)

raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)

elsif value.is_a?(Integer)

raise ArgumentError, "#{name} must be non-negative" if value.negative?

else

raise ArgumentError, "#{name} must be an Integer or Float"

end

end

# @rbs (Integer | Float, String) -> void

def validate_df!(value, name)

raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)

if value.is_a?(Float)

raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)

return

end

raise ArgumentError, "#{name} must be non-negative" if value.negative?

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 230:239 Comment: **style:** Nested conditionals - use early returns instead. ```suggestion # @rbs (Integer | Float, String) -> void def validate_df!(value, name) raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) if value.is_a?(Float) raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0) return end raise ArgumentError, "#{name} must be non-negative" if value.negative? end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:10:05Z

lib/classifier/tfidf.rb

+    def validate_ngram_range!(range)
+      valid_structure = range.is_a?(Array) && range.size == 2
+      raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure
+
+      valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? }
+      raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values
+
+      raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
+    end


style: Extract intermediate variables that are only used once - simplify by inlining.

Suggested change

def validate_ngram_range!(range)

valid_structure = range.is_a?(Array) && range.size == 2

raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure

valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? }

raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]

end

# @rbs (Array[Integer]) -> void

def validate_ngram_range!(range)

raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2

raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? }

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 242:250 Comment: **style:** Extract intermediate variables that are only used once - simplify by inlining. ```suggestion # @rbs (Array[Integer]) -> void def validate_ngram_range!(range) raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? } raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

Replace nested if/elsif/else conditionals in validate_df! and validate_ngram_range! with guard clauses for better readability. Inline single-use intermediate variables. Remove redundant "with smoothing" from IDF formula comment. Addresses style feedback from PR #107 review.

cardmagic · 2025-12-29T03:22:42Z

@greptile-apps Thanks for the review! I've addressed the structural issues:

✅ Removed "with smoothing" from the IDF comment (line 76)
✅ Refactored validate_df! to use early returns instead of nested conditionals
✅ Inlined intermediate variables in validate_ngram_range!

However, I've intentionally kept the method documentation comments (lines 52, 84, 102, 109, 141). These serve as RDoc documentation for the public API and are valuable for users of the gem who rely on generated documentation. The style guide's guidance against redundant comments applies to inline implementation comments, not API documentation.

Could you please re-review when you get a chance?

greptile-apps

_{5 files reviewed, 12 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-29T03:26:31Z

lib/classifier/tfidf.rb

+    def extract_terms(document)
+      result = Hash.new(0)
+
+      if @ngram_range[0] <= 1
+        word_hash = document.clean_word_hash
+        word_hash.each { |term, count| result[term] += count }
+      end
+
+      if @ngram_range[1] > 1
+        tokens = tokenize_for_ngrams(document)
+        (2..@ngram_range[1]).each do |n|
+          next if n < @ngram_range[0]
+
+          generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
+        end
+      end
+
+      result


style: Nested conditionals - flatten with early returns:

Suggested change

def extract_terms(document)

result = Hash.new(0)

if @ngram_range[0] <= 1

word_hash = document.clean_word_hash

word_hash.each { |term, count| result[term] += count }

end

if @ngram_range[1] > 1

tokens = tokenize_for_ngrams(document)

(2..@ngram_range[1]).each do |n|

next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }

end

end

result

def extract_terms(document)

result = Hash.new(0)

if @ngram_range[0] <= 1

word_hash = document.clean_word_hash

word_hash.each { |term, count| result[term] += count }

end

return result if @ngram_range[1] <= 1

tokens = tokenize_for_ngrams(document)

(2..@ngram_range[1]).each do |n|

next if n < @ngram_range[0]

generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }

end

result

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 175:192 Comment: **style:** Nested conditionals - flatten with early returns: ```suggestion def extract_terms(document) result = Hash.new(0) if @ngram_range[0] <= 1 word_hash = document.clean_word_hash word_hash.each { |term, count| result[term] += count } end return result if @ngram_range[1] <= 1 tokens = tokenize_for_ngrams(document) (2..@ngram_range[1]).each do |n| next if n < @ngram_range[0] generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } end result end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:32Z

lib/classifier/tfidf.rb

+    # @rbs (Integer | Float, String) -> void
+    def validate_df!(value, name)
+      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
+      raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
+      raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
+    end


style: Nested conditionals - use early returns:

Suggested change

# @rbs (Integer | Float, String) -> void

def validate_df!(value, name)

raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)

raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)

raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?

end

def validate_df!(value, name)

raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)

return if value.is_a?(Integer) && !value.negative?

return if value.is_a?(Float) && value.between?(0.0, 1.0)

raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 230:235 Comment: **style:** Nested conditionals - use early returns: ```suggestion def validate_df!(value, name) raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) return if value.is_a?(Integer) && !value.negative? return if value.is_a?(Float) && value.between?(0.0, 1.0) raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative" end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:32Z

lib/classifier/tfidf.rb

+    def validate_ngram_range!(range)
+      raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
+      raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v|
+        v.is_a?(Integer) && v.positive?
+      end
+      raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
+    end


style: Nested conditionals - flatten validation:

Suggested change

def validate_ngram_range!(range)

raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2

raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v|

v.is_a?(Integer) && v.positive?

end

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]

end

def validate_ngram_range!(range)

raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2

raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?)

raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 238:244 Comment: **style:** Nested conditionals - flatten validation: ```suggestion def validate_ngram_range!(range) raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?) raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:33Z

lib/classifier/tfidf.rb

+      min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df
+      max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
+
+      doc_freq.between?(min_count, max_count)


style: Extract intermediate variables that are used only once - inline directly:

Suggested change

min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df

max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df

doc_freq.between?(min_count, max_count)

def within_df_bounds?(doc_freq, num_docs)

doc_freq.between?(

@min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,

@max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df

)

end

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 214:217 Comment: **style:** Extract intermediate variables that are used only once - inline directly: ```suggestion def within_df_bounds?(doc_freq, num_docs) doc_freq.between?( @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df, @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df ) end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:33Z

lib/classifier/tfidf.rb

+    # Creates a new TF-IDF vectorizer.
+    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
+    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
+    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
+    #


style: Comments restate what's obvious from parameter names and code. Per style guide: remove.

Suggested change

# Creates a new TF-IDF vectorizer.

# - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)

# - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams

# - sublinear_tf: use 1 + log(tf) instead of raw term frequency

#

# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,

# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void

def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 30:34 Comment: **style:** Comments restate what's obvious from parameter names and code. Per style guide: remove. ```suggestion # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float, # ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:35Z

lib/classifier/tfidf.rb

+    # Transforms a document into a normalized TF-IDF vector.
+    # @rbs (String) -> Hash[Symbol, Float]


style: Comment restates the method name.

Suggested change

# Transforms a document into a normalized TF-IDF vector.

# @rbs (String) -> Hash[Symbol, Float]

# @rbs (String) -> Hash[Symbol, Float]

def transform(document)

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 84:85 Comment: **style:** Comment restates the method name. ```suggestion # @rbs (String) -> Hash[Symbol, Float] def transform(document) ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:35Z

lib/classifier/tfidf.rb

+    # Fits and transforms in one step.
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]


style: Comment restates what the method name clearly indicates.

Suggested change

# Fits and transforms in one step.

# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

def fit_transform(documents)

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 102:103 Comment: **style:** Comment restates what the method name clearly indicates. ```suggestion # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] def fit_transform(documents) ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:36Z

lib/classifier/tfidf.rb

+    # Returns vocabulary terms in index order.
+    # @rbs () -> Array[Symbol]


style: Comment restates the method signature.

Suggested change

# Returns vocabulary terms in index order.

# @rbs () -> Array[Symbol]

# @rbs () -> Array[Symbol]

def feature_names

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 109:110 Comment: **style:** Comment restates the method signature. ```suggestion # @rbs () -> Array[Symbol] def feature_names ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:36Z

lib/classifier/tfidf.rb

+    # Loads a vectorizer from JSON.
+    # @rbs (String | Hash[String, untyped]) -> TFIDF


style: Comment restates method signature.

Suggested change

# Loads a vectorizer from JSON.

# @rbs (String | Hash[String, untyped]) -> TFIDF

# @rbs (String | Hash[String, untyped]) -> TFIDF

def self.from_json(json)

Context Used: Context from dashboard - CLAUDE.md (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 141:142 Comment: **style:** Comment restates method signature. ```suggestion # @rbs (String | Hash[String, untyped]) -> TFIDF def self.from_json(json) ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)) Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:26:37Z

lib/classifier/tfidf.rb

+      @vocabulary = {}
+      @idf = {}
+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1
+
+        # IDF: log((N + 1) / (df + 1)) + 1
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+      end


style: Intermediate variable vocab_index only increments. Consider using each.with_index:

Suggested change

@vocabulary = {}

@idf = {}

vocab_index = 0

document_frequencies.each do |term, df|

next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index

vocab_index += 1

# IDF: log((N + 1) / (df + 1)) + 1

@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1

end

selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }

selected_terms.each.with_index do |(term, df), vocab_index|

@vocabulary[term] = vocab_index

# IDF with smoothing

@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1

end

Context Used: Context from dashboard - CLAUDE.md (source). Is the vocabulary index ordering requirement flexible enough for this refactor?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 66:78 Comment: **style:** Intermediate variable `vocab_index` only increments. Consider using `each.with_index`: ```suggestion selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) } selected_terms.each.with_index do |(term, df), vocab_index| @vocabulary[term] = vocab_index # IDF with smoothing @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 end ``` **Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)). Is the vocabulary index ordering requirement flexible enough for this refactor? Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

- Add type annotation to empty hash in transform method - Use JSON.generate instead of Hash#to_json for type safety - Cast multiplication result to Float for type checker - Auto-fix Minitest assertion style in tests

cardmagic · 2025-12-29T03:33:08Z

@greptile-apps Thanks for the follow-up review! Here's what was addressed:

Applied:

✅ extract_terms: Early return when ngram_range[1] <= 1
✅ validate_ngram_range!: Simplified with range.all?(Integer) && range.all?(&:positive?)
✅ within_df_bounds?: Inlined single-use variables

Not applied (intentionally kept):

validate_df! (lines 230-235): The current code already uses guard clauses - each line raises immediately on invalid input. The suggestion adds early returns for valid cases then a conditional raise, which is more complex, not simpler.
initialize comments (lines 30-34): These explain parameter semantics, not just names:
- min_df/max_df: Integer vs Float behavior differs (count vs proportion)
- ngram_range: Example values like [1,2] help users
- sublinear_tf: Explains the 1 + log(tf) formula
This is useful API documentation for gem users, not redundant commentary.

Could you please re-review?

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-29T03:36:24Z

lib/classifier/tfidf.rb

+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1


style: Manual counter increment. Use each.with_index:

Suggested change

vocab_index = 0

document_frequencies.each do |term, df|

next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index

vocab_index += 1

document_frequencies.each.with_index do |term_df, vocab_index|

term, df = term_df

next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 68:74 Comment: **style:** Manual counter increment. Use `each.with_index`: ```suggestion document_frequencies.each.with_index do |term_df, vocab_index| term, df = term_df next unless within_df_bounds?(df, @num_documents) @vocabulary[term] = vocab_index ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

cardmagic · 2025-12-29T03:38:16Z

@greptile-apps One more note on the with_index suggestion for the vocabulary indexing (lines 68-74):

This change would introduce a bug. The current manual counter only increments when a term passes the within_df_bounds? check. Using with_index would assign indices for all terms including skipped ones:

Current (correct):

Term A passes → index 0
Term B skipped → no index  
Term C passes → index 1

With with_index (broken):

Term A passes → index 0
Term B skipped → no index
Term C passes → index 2 ❌ (gap in indices)

The vocabulary indices need to be contiguous for the vectorizer to work correctly.

Could you please re-review when you get a chance?

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-29T03:41:02Z

lib/classifier/tfidf.rb

+    # Learns vocabulary and IDF weights from the corpus.
+    # @rbs (Array[String]) -> self
+    def fit(documents)
+      raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)


style: Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on .each. Remove unless documents actually need array-specific methods beyond iteration.

Suggested change

raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)

raise ArgumentError, 'documents cannot be empty' if documents.empty?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 55:55 Comment: **style:** Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on `.each`. Remove unless documents actually need array-specific methods beyond iteration. ```suggestion raise ArgumentError, 'documents cannot be empty' if documents.empty? ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:41:02Z

lib/classifier/tfidf.rb

+      unless range.all?(Integer) && range.all?(&:positive?)
+        raise ArgumentError, 'ngram_range values must be positive integers'
+      end


style: Double negative with unless. Use early return for clearer logic:

Suggested change

unless range.all?(Integer) && range.all?(&:positive?)

raise ArgumentError, 'ngram_range values must be positive integers'

end

return if range.all?(Integer) && range.all?(&:positive?)

raise ArgumentError, 'ngram_range values must be positive integers'

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 240:242 Comment: **style:** Double negative with `unless`. Use early return for clearer logic: ```suggestion return if range.all?(Integer) && range.all?(&:positive?) raise ArgumentError, 'ngram_range values must be positive integers' ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2025-12-29T03:41:03Z

lib/classifier/tfidf.rb

+      @vocabulary = {}
+      @idf = {}
+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1


style: Resetting @vocabulary and @idf to empty hashes just to reassign them. These were already initialized empty in initialize. Remove redundant assignments:

Suggested change

@vocabulary = {}

@idf = {}

vocab_index = 0

document_frequencies.each do |term, df|

next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index

vocab_index += 1

vocab_index = 0

document_frequencies.each do |term, df|

next unless within_df_bounds?(df, @num_documents)

@vocabulary[term] = vocab_index

vocab_index += 1

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: lib/classifier/tfidf.rb Line: 66:74 Comment: **style:** Resetting `@vocabulary` and `@idf` to empty hashes just to reassign them. These were already initialized empty in `initialize`. Remove redundant assignments: ```suggestion vocab_index = 0 document_frequencies.each do |term, df| next unless within_df_bounds?(df, @num_documents) @vocabulary[term] = vocab_index vocab_index += 1 ``` Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! How can I resolve this? If you propose a fix, please make it concise.

greptile-apps bot reviewed Dec 29, 2025

View reviewed changes

cardmagic added 4 commits December 28, 2025 19:27

fix(tfidf): resolve lint and typecheck errors

748a209

- Add type annotation to empty hash in transform method - Use JSON.generate instead of Hash#to_json for type safety - Cast multiplication result to Float for type checker - Auto-fix Minitest assertion style in tests

refactor(tfidf): use early return in extract_terms

71d086c

refactor(tfidf): simplify ngram validation with all?(Integer)

7c05edf

refactor(tfidf): inline single-use variables in within_df_bounds?

38a65f8

greptile-apps bot reviewed Dec 29, 2025

View reviewed changes

fix(tfidf): fix line length in validate_ngram_range!

47c8f50

greptile-apps bot reviewed Dec 29, 2025

View reviewed changes

cardmagic merged commit 04324e6 into master Dec 29, 2025
6 checks passed

	# Learns vocabulary and IDF weights from the corpus.
	# @rbs (Array[String]) -> self

	# IDF: log((N + 1) / (df + 1)) + 1 with smoothing
	# IDF: log((N + 1) / (df + 1)) + 1

	# Transforms a document into a normalized TF-IDF vector.
	# @rbs (String) -> Hash[Symbol, Float]

	# Fits and transforms in one step.
	# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]

	# Returns vocabulary terms in index order.
	# @rbs () -> Array[Symbol]

	# Loads a vectorizer from JSON.
	# @rbs (String \| Hash[String, untyped]) -> TFIDF

	raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
	raise ArgumentError, 'documents cannot be empty' if documents.empty?

feat(tfidf): add TF-IDF vectorizer #107

feat(tfidf): add TF-IDF vectorizer #107

Conversation

cardmagic commented Dec 29, 2025

Summary

Example

Test plan

Uh oh!

greptile-apps bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

cardmagic commented Dec 29, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Dec 29, 2025 •

edited

Loading