-
Notifications
You must be signed in to change notification settings - Fork 123
feat(tfidf): add TF-IDF vectorizer #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adds standalone TF-IDF vectorizer for text feature extraction. Foundation for classic text classification approaches like logistic regression and improved LSI quality. Features: - fit/transform/fit_transform API (scikit-learn style) - Vocabulary filtering via min_df/max_df thresholds - N-gram support (unigrams, bigrams, trigrams) - Sublinear TF scaling (1 + log(tf)) - L2 normalized output vectors - JSON and Marshal serialization Leverages existing word_hash infrastructure for term frequency extraction with stemming and stopword removal. Closes #104
Greptile SummaryAdds a well-designed TF-IDF vectorizer with scikit-learn style API ( Key additions:
Code quality observations: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant TFIDF
participant WordHash as String#clean_word_hash
participant Math
Note over User,TFIDF: Fit Phase - Learn Vocabulary & IDF
User->>TFIDF: fit(documents)
TFIDF->>TFIDF: validate documents array
loop For each document
TFIDF->>TFIDF: extract_terms(doc)
TFIDF->>WordHash: clean_word_hash (unigrams)
WordHash-->>TFIDF: term frequencies
alt ngram_range[1] > 1
TFIDF->>TFIDF: tokenize_for_ngrams(doc)
TFIDF->>TFIDF: generate_ngrams(tokens, n)
end
TFIDF->>TFIDF: count document frequencies
end
loop For each term
TFIDF->>TFIDF: within_df_bounds?(df, num_docs)
alt term passes min_df/max_df filters
TFIDF->>Math: log((N+1)/(df+1)) + 1
Math-->>TFIDF: IDF value
TFIDF->>TFIDF: add to vocabulary & idf
end
end
TFIDF->>TFIDF: set fitted = true
TFIDF-->>User: self
Note over User,TFIDF: Transform Phase - Convert Text to TF-IDF Vector
User->>TFIDF: transform(document)
TFIDF->>TFIDF: check fitted?
TFIDF->>TFIDF: extract_terms(document)
TFIDF->>WordHash: clean_word_hash
WordHash-->>TFIDF: term frequencies
loop For each term in vocabulary
alt sublinear_tf enabled
TFIDF->>Math: 1 + log(tf)
Math-->>TFIDF: sublinear TF
end
TFIDF->>TFIDF: tf_value * idf[term]
end
TFIDF->>TFIDF: normalize_vector (L2 norm)
TFIDF->>Math: sqrt(sum of squares)
Math-->>TFIDF: magnitude
TFIDF->>TFIDF: divide by magnitude
TFIDF-->>User: normalized TF-IDF vector
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 10 comments
| # Creates a new TF-IDF vectorizer. | ||
| # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion) | ||
| # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams | ||
| # - sublinear_tf: use 1 + log(tf) instead of raw term frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.
| # Creates a new TF-IDF vectorizer. | |
| # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion) | |
| # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams | |
| # - sublinear_tf: use 1 + log(tf) instead of raw term frequency | |
| # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float, | |
| # ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void | |
| def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:33
Comment:
**style:** Comments repeat the parameter names and what's obvious from code. Per style guide: remove comments that restate what code shows.
```suggestion
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| @fitted = false | ||
| end | ||
|
|
||
| # Learns vocabulary and IDF weights from the corpus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates the function name.
| # Learns vocabulary and IDF weights from the corpus. | |
| # @rbs (Array[String]) -> self |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 52:52
Comment:
**style:** Comment restates the function name.
```suggestion
# @rbs (Array[String]) -> self
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/tfidf.rb
Outdated
| @vocabulary[term] = vocab_index | ||
| vocab_index += 1 | ||
|
|
||
| # IDF: log((N + 1) / (df + 1)) + 1 with smoothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment explains the formula but "with smoothing" is already obvious from the + 1 terms. Make more concise.
| # IDF: log((N + 1) / (df + 1)) + 1 with smoothing | |
| # IDF: log((N + 1) / (df + 1)) + 1 |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 76:76
Comment:
**style:** Comment explains the formula but "with smoothing" is already obvious from the `+ 1` terms. Make more concise.
```suggestion
# IDF: log((N + 1) / (df + 1)) + 1
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| self | ||
| end | ||
|
|
||
| # Transforms a document into a normalized TF-IDF vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment just repeats what the code does.
| # Transforms a document into a normalized TF-IDF vector. | |
| # @rbs (String) -> Hash[Symbol, Float] |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:84
Comment:
**style:** Comment just repeats what the code does.
```suggestion
# @rbs (String) -> Hash[Symbol, Float]
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| normalize_vector(result) | ||
| end | ||
|
|
||
| # Fits and transforms in one step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates function signature.
| # Fits and transforms in one step. | |
| # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:102
Comment:
**style:** Comment restates function signature.
```suggestion
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| documents.map { |doc| transform(doc) } | ||
| end | ||
|
|
||
| # Returns vocabulary terms in index order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates what's obvious from the method name.
| # Returns vocabulary terms in index order. | |
| # @rbs () -> Array[Symbol] |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:109
Comment:
**style:** Comment restates what's obvious from the method name.
```suggestion
# @rbs () -> Array[Symbol]
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| as_json.to_json | ||
| end | ||
|
|
||
| # Loads a vectorizer from JSON. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates function signature.
| # Loads a vectorizer from JSON. | |
| # @rbs (String | Hash[String, untyped]) -> TFIDF |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:141
Comment:
**style:** Comment restates function signature.
```suggestion
# @rbs (String | Hash[String, untyped]) -> TFIDF
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def extract_terms(document) | ||
| result = Hash.new(0) | ||
|
|
||
| if @ngram_range[0] <= 1 | ||
| word_hash = document.clean_word_hash | ||
| word_hash.each { |term, count| result[term] += count } | ||
| end | ||
|
|
||
| if @ngram_range[1] > 1 | ||
| tokens = tokenize_for_ngrams(document) | ||
| (2..@ngram_range[1]).each do |n| | ||
| next if n < @ngram_range[0] | ||
|
|
||
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | ||
| end | ||
| end | ||
|
|
||
| result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Nested conditionals - refactor with early returns per style guide.
| def extract_terms(document) | |
| result = Hash.new(0) | |
| if @ngram_range[0] <= 1 | |
| word_hash = document.clean_word_hash | |
| word_hash.each { |term, count| result[term] += count } | |
| end | |
| if @ngram_range[1] > 1 | |
| tokens = tokenize_for_ngrams(document) | |
| (2..@ngram_range[1]).each do |n| | |
| next if n < @ngram_range[0] | |
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | |
| end | |
| end | |
| result | |
| # @rbs (String) -> Hash[Symbol, Integer] | |
| def extract_terms(document) | |
| result = Hash.new(0) | |
| return result if @ngram_range[1] == 1 && @ngram_range[0] > 1 | |
| if @ngram_range[0] <= 1 | |
| word_hash = document.clean_word_hash | |
| word_hash.each { |term, count| result[term] += count } | |
| end | |
| return result if @ngram_range[1] <= 1 | |
| tokens = tokenize_for_ngrams(document) | |
| (2..@ngram_range[1]).each do |n| | |
| next if n < @ngram_range[0] | |
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | |
| end | |
| result | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 175:192
Comment:
**style:** Nested conditionals - refactor with early returns per style guide.
```suggestion
# @rbs (String) -> Hash[Symbol, Integer]
def extract_terms(document)
result = Hash.new(0)
return result if @ngram_range[1] == 1 && @ngram_range[0] > 1
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
return result if @ngram_range[1] <= 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
result
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # @rbs (Integer | Float, String) -> void | ||
| def validate_df!(value, name) | ||
| if value.is_a?(Float) | ||
| raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0) | ||
| elsif value.is_a?(Integer) | ||
| raise ArgumentError, "#{name} must be non-negative" if value.negative? | ||
| else | ||
| raise ArgumentError, "#{name} must be an Integer or Float" | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Nested conditionals - use early returns instead.
| # @rbs (Integer | Float, String) -> void | |
| def validate_df!(value, name) | |
| if value.is_a?(Float) | |
| raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0) | |
| elsif value.is_a?(Integer) | |
| raise ArgumentError, "#{name} must be non-negative" if value.negative? | |
| else | |
| raise ArgumentError, "#{name} must be an Integer or Float" | |
| end | |
| end | |
| # @rbs (Integer | Float, String) -> void | |
| def validate_df!(value, name) | |
| raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) | |
| if value.is_a?(Float) | |
| raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0) | |
| return | |
| end | |
| raise ArgumentError, "#{name} must be non-negative" if value.negative? | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 230:239
Comment:
**style:** Nested conditionals - use early returns instead.
```suggestion
# @rbs (Integer | Float, String) -> void
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
if value.is_a?(Float)
raise ArgumentError, "#{name} must be between 0.0 and 1.0" unless value.between?(0.0, 1.0)
return
end
raise ArgumentError, "#{name} must be non-negative" if value.negative?
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def validate_ngram_range!(range) | ||
| valid_structure = range.is_a?(Array) && range.size == 2 | ||
| raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure | ||
|
|
||
| valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? } | ||
| raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values | ||
|
|
||
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Extract intermediate variables that are only used once - simplify by inlining.
| def validate_ngram_range!(range) | |
| valid_structure = range.is_a?(Array) && range.size == 2 | |
| raise ArgumentError, 'ngram_range must be an array of two integers' unless valid_structure | |
| valid_values = range.all? { |v| v.is_a?(Integer) && v.positive? } | |
| raise ArgumentError, 'ngram_range values must be positive integers' unless valid_values | |
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | |
| end | |
| # @rbs (Array[Integer]) -> void | |
| def validate_ngram_range!(range) | |
| raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 | |
| raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? } | |
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 242:250
Comment:
**style:** Extract intermediate variables that are only used once - simplify by inlining.
```suggestion
# @rbs (Array[Integer]) -> void
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? { |v| v.is_a?(Integer) && v.positive? }
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.Replace nested if/elsif/else conditionals in validate_df! and validate_ngram_range! with guard clauses for better readability. Inline single-use intermediate variables. Remove redundant "with smoothing" from IDF formula comment. Addresses style feedback from PR #107 review.
|
@greptile-apps Thanks for the review! I've addressed the structural issues:
However, I've intentionally kept the method documentation comments (lines 52, 84, 102, 109, 141). These serve as RDoc documentation for the public API and are valuable for users of the gem who rely on generated documentation. The style guide's guidance against redundant comments applies to inline implementation comments, not API documentation. Could you please re-review when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 12 comments
| def extract_terms(document) | ||
| result = Hash.new(0) | ||
|
|
||
| if @ngram_range[0] <= 1 | ||
| word_hash = document.clean_word_hash | ||
| word_hash.each { |term, count| result[term] += count } | ||
| end | ||
|
|
||
| if @ngram_range[1] > 1 | ||
| tokens = tokenize_for_ngrams(document) | ||
| (2..@ngram_range[1]).each do |n| | ||
| next if n < @ngram_range[0] | ||
|
|
||
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | ||
| end | ||
| end | ||
|
|
||
| result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Nested conditionals - flatten with early returns:
| def extract_terms(document) | |
| result = Hash.new(0) | |
| if @ngram_range[0] <= 1 | |
| word_hash = document.clean_word_hash | |
| word_hash.each { |term, count| result[term] += count } | |
| end | |
| if @ngram_range[1] > 1 | |
| tokens = tokenize_for_ngrams(document) | |
| (2..@ngram_range[1]).each do |n| | |
| next if n < @ngram_range[0] | |
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | |
| end | |
| end | |
| result | |
| def extract_terms(document) | |
| result = Hash.new(0) | |
| if @ngram_range[0] <= 1 | |
| word_hash = document.clean_word_hash | |
| word_hash.each { |term, count| result[term] += count } | |
| end | |
| return result if @ngram_range[1] <= 1 | |
| tokens = tokenize_for_ngrams(document) | |
| (2..@ngram_range[1]).each do |n| | |
| next if n < @ngram_range[0] | |
| generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 } | |
| end | |
| result | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 175:192
Comment:
**style:** Nested conditionals - flatten with early returns:
```suggestion
def extract_terms(document)
result = Hash.new(0)
if @ngram_range[0] <= 1
word_hash = document.clean_word_hash
word_hash.each { |term, count| result[term] += count }
end
return result if @ngram_range[1] <= 1
tokens = tokenize_for_ngrams(document)
(2..@ngram_range[1]).each do |n|
next if n < @ngram_range[0]
generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
end
result
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # @rbs (Integer | Float, String) -> void | ||
| def validate_df!(value, name) | ||
| raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) | ||
| raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0) | ||
| raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative? | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Nested conditionals - use early returns:
| # @rbs (Integer | Float, String) -> void | |
| def validate_df!(value, name) | |
| raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) | |
| raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0) | |
| raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative? | |
| end | |
| def validate_df!(value, name) | |
| raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer) | |
| return if value.is_a?(Integer) && !value.negative? | |
| return if value.is_a?(Float) && value.between?(0.0, 1.0) | |
| raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative" | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 230:235
Comment:
**style:** Nested conditionals - use early returns:
```suggestion
def validate_df!(value, name)
raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
return if value.is_a?(Integer) && !value.negative?
return if value.is_a?(Float) && value.between?(0.0, 1.0)
raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| def validate_ngram_range!(range) | ||
| raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 | ||
| raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v| | ||
| v.is_a?(Integer) && v.positive? | ||
| end | ||
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Nested conditionals - flatten validation:
| def validate_ngram_range!(range) | |
| raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 | |
| raise ArgumentError, 'ngram_range values must be positive integers' unless range.all? do |v| | |
| v.is_a?(Integer) && v.positive? | |
| end | |
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | |
| end | |
| def validate_ngram_range!(range) | |
| raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2 | |
| raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?) | |
| raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1] | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 238:244
Comment:
**style:** Nested conditionals - flatten validation:
```suggestion
def validate_ngram_range!(range)
raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
raise ArgumentError, 'ngram_range values must be positive integers' unless range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
lib/classifier/tfidf.rb
Outdated
| min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df | ||
| max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df | ||
|
|
||
| doc_freq.between?(min_count, max_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Extract intermediate variables that are used only once - inline directly:
| min_count = @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df | |
| max_count = @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df | |
| doc_freq.between?(min_count, max_count) | |
| def within_df_bounds?(doc_freq, num_docs) | |
| doc_freq.between?( | |
| @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df, | |
| @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df | |
| ) | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 214:217
Comment:
**style:** Extract intermediate variables that are used only once - inline directly:
```suggestion
def within_df_bounds?(doc_freq, num_docs)
doc_freq.between?(
@min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,
@max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
)
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Creates a new TF-IDF vectorizer. | ||
| # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion) | ||
| # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams | ||
| # - sublinear_tf: use 1 + log(tf) instead of raw term frequency | ||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comments restate what's obvious from parameter names and code. Per style guide: remove.
| # Creates a new TF-IDF vectorizer. | |
| # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion) | |
| # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams | |
| # - sublinear_tf: use 1 + log(tf) instead of raw term frequency | |
| # | |
| # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float, | |
| # ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void | |
| def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 30:34
Comment:
**style:** Comments restate what's obvious from parameter names and code. Per style guide: remove.
```suggestion
# @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
# ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Transforms a document into a normalized TF-IDF vector. | ||
| # @rbs (String) -> Hash[Symbol, Float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates the method name.
| # Transforms a document into a normalized TF-IDF vector. | |
| # @rbs (String) -> Hash[Symbol, Float] | |
| # @rbs (String) -> Hash[Symbol, Float] | |
| def transform(document) |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 84:85
Comment:
**style:** Comment restates the method name.
```suggestion
# @rbs (String) -> Hash[Symbol, Float]
def transform(document)
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Fits and transforms in one step. | ||
| # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates what the method name clearly indicates.
| # Fits and transforms in one step. | |
| # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] | |
| # @rbs (Array[String]) -> Array[Hash[Symbol, Float]] | |
| def fit_transform(documents) |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 102:103
Comment:
**style:** Comment restates what the method name clearly indicates.
```suggestion
# @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
def fit_transform(documents)
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Returns vocabulary terms in index order. | ||
| # @rbs () -> Array[Symbol] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates the method signature.
| # Returns vocabulary terms in index order. | |
| # @rbs () -> Array[Symbol] | |
| # @rbs () -> Array[Symbol] | |
| def feature_names |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 109:110
Comment:
**style:** Comment restates the method signature.
```suggestion
# @rbs () -> Array[Symbol]
def feature_names
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| # Loads a vectorizer from JSON. | ||
| # @rbs (String | Hash[String, untyped]) -> TFIDF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Comment restates method signature.
| # Loads a vectorizer from JSON. | |
| # @rbs (String | Hash[String, untyped]) -> TFIDF | |
| # @rbs (String | Hash[String, untyped]) -> TFIDF | |
| def self.from_json(json) |
Context Used: Context from dashboard - CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 141:142
Comment:
**style:** Comment restates method signature.
```suggestion
# @rbs (String | Hash[String, untyped]) -> TFIDF
def self.from_json(json)
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917))
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| @vocabulary = {} | ||
| @idf = {} | ||
| vocab_index = 0 | ||
|
|
||
| document_frequencies.each do |term, df| | ||
| next unless within_df_bounds?(df, @num_documents) | ||
|
|
||
| @vocabulary[term] = vocab_index | ||
| vocab_index += 1 | ||
|
|
||
| # IDF: log((N + 1) / (df + 1)) + 1 | ||
| @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Intermediate variable vocab_index only increments. Consider using each.with_index:
| @vocabulary = {} | |
| @idf = {} | |
| vocab_index = 0 | |
| document_frequencies.each do |term, df| | |
| next unless within_df_bounds?(df, @num_documents) | |
| @vocabulary[term] = vocab_index | |
| vocab_index += 1 | |
| # IDF: log((N + 1) / (df + 1)) + 1 | |
| @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 | |
| end | |
| selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) } | |
| selected_terms.each.with_index do |(term, df), vocab_index| | |
| @vocabulary[term] = vocab_index | |
| # IDF with smoothing | |
| @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 | |
| end |
Context Used: Context from dashboard - CLAUDE.md (source). Is the vocabulary index ordering requirement flexible enough for this refactor?
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:78
Comment:
**style:** Intermediate variable `vocab_index` only increments. Consider using `each.with_index`:
```suggestion
selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
selected_terms.each.with_index do |(term, df), vocab_index|
@vocabulary[term] = vocab_index
# IDF with smoothing
@idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
end
```
**Context Used:** Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=da491e84-75dc-41f4-bb96-ab9502d43917)). Is the vocabulary index ordering requirement flexible enough for this refactor?
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Add type annotation to empty hash in transform method - Use JSON.generate instead of Hash#to_json for type safety - Cast multiplication result to Float for type checker - Auto-fix Minitest assertion style in tests
|
@greptile-apps Thanks for the follow-up review! Here's what was addressed: Applied:
Not applied (intentionally kept):
Could you please re-review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 1 comment
| vocab_index = 0 | ||
|
|
||
| document_frequencies.each do |term, df| | ||
| next unless within_df_bounds?(df, @num_documents) | ||
|
|
||
| @vocabulary[term] = vocab_index | ||
| vocab_index += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Manual counter increment. Use each.with_index:
| vocab_index = 0 | |
| document_frequencies.each do |term, df| | |
| next unless within_df_bounds?(df, @num_documents) | |
| @vocabulary[term] = vocab_index | |
| vocab_index += 1 | |
| document_frequencies.each.with_index do |term_df, vocab_index| | |
| term, df = term_df | |
| next unless within_df_bounds?(df, @num_documents) | |
| @vocabulary[term] = vocab_index |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 68:74
Comment:
**style:** Manual counter increment. Use `each.with_index`:
```suggestion
document_frequencies.each.with_index do |term_df, vocab_index|
term, df = term_df
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
@greptile-apps One more note on the This change would introduce a bug. The current manual counter only increments when a term passes the Current (correct): With The vocabulary indices need to be contiguous for the vectorizer to work correctly. Could you please re-review when you get a chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, 3 comments
| # Learns vocabulary and IDF weights from the corpus. | ||
| # @rbs (Array[String]) -> self | ||
| def fit(documents) | ||
| raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on .each. Remove unless documents actually need array-specific methods beyond iteration.
| raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array) | |
| raise ArgumentError, 'documents cannot be empty' if documents.empty? |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 55:55
Comment:
**style:** Defensive type check unnecessary. Ruby is duck-typed - if caller passes non-iterable, it'll fail naturally on `.each`. Remove unless documents actually need array-specific methods beyond iteration.
```suggestion
raise ArgumentError, 'documents cannot be empty' if documents.empty?
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| unless range.all?(Integer) && range.all?(&:positive?) | ||
| raise ArgumentError, 'ngram_range values must be positive integers' | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Double negative with unless. Use early return for clearer logic:
| unless range.all?(Integer) && range.all?(&:positive?) | |
| raise ArgumentError, 'ngram_range values must be positive integers' | |
| end | |
| return if range.all?(Integer) && range.all?(&:positive?) | |
| raise ArgumentError, 'ngram_range values must be positive integers' |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 240:242
Comment:
**style:** Double negative with `unless`. Use early return for clearer logic:
```suggestion
return if range.all?(Integer) && range.all?(&:positive?)
raise ArgumentError, 'ngram_range values must be positive integers'
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| @vocabulary = {} | ||
| @idf = {} | ||
| vocab_index = 0 | ||
|
|
||
| document_frequencies.each do |term, df| | ||
| next unless within_df_bounds?(df, @num_documents) | ||
|
|
||
| @vocabulary[term] = vocab_index | ||
| vocab_index += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Resetting @vocabulary and @idf to empty hashes just to reassign them. These were already initialized empty in initialize. Remove redundant assignments:
| @vocabulary = {} | |
| @idf = {} | |
| vocab_index = 0 | |
| document_frequencies.each do |term, df| | |
| next unless within_df_bounds?(df, @num_documents) | |
| @vocabulary[term] = vocab_index | |
| vocab_index += 1 | |
| vocab_index = 0 | |
| document_frequencies.each do |term, df| | |
| next unless within_df_bounds?(df, @num_documents) | |
| @vocabulary[term] = vocab_index | |
| vocab_index += 1 |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/classifier/tfidf.rb
Line: 66:74
Comment:
**style:** Resetting `@vocabulary` and `@idf` to empty hashes just to reassign them. These were already initialized empty in `initialize`. Remove redundant assignments:
```suggestion
vocab_index = 0
document_frequencies.each do |term, df|
next unless within_df_bounds?(df, @num_documents)
@vocabulary[term] = vocab_index
vocab_index += 1
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.
Summary
Adds a standalone TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer for text feature extraction.
Leverages existing
word_hashinfrastructure for term frequency extraction with stemming and stopword removal.Example
Test plan
Closes #104