cardmagic · cardmagic · Dec 29, 2025 · Dec 29, 2025 · Dec 29, 2025 · Dec 29, 2025
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![CI](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml/badge.svg)](https://github.com/cardmagic/classifier/actions/workflows/ruby.yml)
 [![License: LGPL](https://img.shields.io/badge/License-LGPL_2.1-blue.svg)](https://opensource.org/licenses/LGPL-2.1)
 
-A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), and k-Nearest Neighbors (kNN) algorithms.
+A Ruby library for text classification using Bayesian, LSI (Latent Semantic Indexing), k-Nearest Neighbors (kNN), and TF-IDF algorithms.
 
 **[Documentation](https://rubyclassifier.com/docs)** · **[Tutorials](https://rubyclassifier.com/docs/tutorials)** · **[Guides](https://rubyclassifier.com/docs/guides)**
 
@@ -14,6 +14,7 @@ A Ruby library for text classification using Bayesian, LSI (Latent Semantic Inde
 - [Bayesian Classifier](#bayesian-classifier)
 - [LSI (Latent Semantic Indexing)](#lsi-latent-semantic-indexing)
 - [k-Nearest Neighbors (kNN)](#k-nearest-neighbors-knn)
+- [TF-IDF Vectorizer](#tf-idf-vectorizer)
 - [Persistence](#persistence)
 - [Performance](#performance)
 - [Development](#development)
@@ -256,6 +257,77 @@ knn.categories
 
 **Why the size difference?** Bayes stores aggregate statistics—adding 10,000 documents just increments counters. kNN stores every example and compares against all of them during classification, so performance degrades with size.
 
+## TF-IDF Vectorizer
+
+Transform text documents into TF-IDF (Term Frequency-Inverse Document Frequency) weighted feature vectors. TF-IDF downweights common words and upweights discriminative terms—the foundation for most classic text classification approaches.
+
+### Quick Start
+
+```ruby
+require 'classifier'
+
+tfidf = Classifier::TFIDF.new
+tfidf.fit(["Dogs are great pets", "Cats are independent", "Birds can fly"])
+
+# Transform text to TF-IDF vector (L2 normalized)
+vector = tfidf.transform("Dogs are loyal")
+# => {:dog=>0.7071..., :loyal=>0.7071...}
+
+# Fit and transform in one step
+vectors = tfidf.fit_transform(documents)
+```
+
+### Options
+
+```ruby
+tfidf = Classifier::TFIDF.new(
+  min_df: 2,           # Minimum document frequency (Integer or Float 0.0-1.0)
+  max_df: 0.95,        # Maximum document frequency (filters very common terms)
+  ngram_range: [1, 2], # Extract unigrams and bigrams
+  sublinear_tf: true   # Use 1 + log(tf) instead of raw term frequency
+)
+```
+
+### Vocabulary Inspection
+
+```ruby
+tfidf.fit(documents)
+
+tfidf.vocabulary      # => {:dog=>0, :cat=>1, :bird=>2, ...}
+tfidf.idf             # => {:dog=>1.405, :cat=>1.405, ...}
+tfidf.feature_names   # => [:dog, :cat, :bird, ...]
+tfidf.num_documents   # => 3
+tfidf.fitted?         # => true
+```
+
+### N-gram Support
+
+```ruby
+# Extract bigrams only
+tfidf = Classifier::TFIDF.new(ngram_range: [2, 2])
+tfidf.fit(["quick brown fox", "lazy brown dog"])
+tfidf.vocabulary.keys
+# => [:quick_brown, :brown_fox, :lazi_brown, :brown_dog]
+
+# Unigrams through trigrams
+tfidf = Classifier::TFIDF.new(ngram_range: [1, 3])
+```
+
+### Serialization
+
+```ruby
+# Save to JSON
+json = tfidf.to_json
+File.write("tfidf.json", json)
+
+# Load from JSON
+loaded = Classifier::TFIDF.from_json(File.read("tfidf.json"))
+
+# Or use Marshal
+data = Marshal.dump(tfidf)
+loaded = Marshal.load(data)
+```
+
 ## Persistence
 
 Save and load classifiers with pluggable storage backends. Works with Bayes, LSI, and kNN classifiers.

diff --git a/lib/classifier.rb b/lib/classifier.rb
@@ -32,3 +32,4 @@
 require 'classifier/bayes'
 require 'classifier/lsi'
 require 'classifier/knn'
+require 'classifier/tfidf'
diff --git a/lib/classifier/errors.rb b/lib/classifier/errors.rb
@@ -13,4 +13,7 @@ class UnsavedChangesError < Error; end
 
   # Raised when a storage operation fails
   class StorageError < Error; end
+
+  # Raised when using an unfitted model
+  class NotFittedError < Error; end
 end
diff --git a/lib/classifier/tfidf.rb b/lib/classifier/tfidf.rb
@@ -0,0 +1,252 @@
+# rbs_inline: enabled
+
+# Author::    Lucas Carlson  (mailto:lucas@rufy.com)
+# Copyright:: Copyright (c) 2024 Lucas Carlson
+# License::   LGPL
+
+require 'json'
+
+module Classifier
+  # TF-IDF vectorizer: transforms text to weighted feature vectors.
+  # Downweights common words, upweights discriminative terms.
+  #
+  # Example:
+  #   tfidf = Classifier::TFIDF.new
+  #   tfidf.fit(["Dogs are great pets", "Cats are independent"])
+  #   tfidf.transform("Dogs are loyal")  # => {:dog=>0.7071..., :loyal=>0.7071...}
+  #
+  class TFIDF
+    # @rbs @min_df: Integer | Float
+    # @rbs @max_df: Integer | Float
+    # @rbs @ngram_range: Array[Integer]
+    # @rbs @sublinear_tf: bool
+    # @rbs @vocabulary: Hash[Symbol, Integer]
+    # @rbs @idf: Hash[Symbol, Float]
+    # @rbs @num_documents: Integer
+    # @rbs @fitted: bool
+
+    attr_reader :vocabulary, :idf, :num_documents
+
+    # Creates a new TF-IDF vectorizer.
+    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
+    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
+    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
-    # Creates a new TF-IDF vectorizer.
-    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
+    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
+    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
+    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
-    # Creates a new TF-IDF vectorizer.
-    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
+    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
+    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
+    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
+    #
-    # Creates a new TF-IDF vectorizer.
-    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
-    #
+    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
+    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
+    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
-    # Creates a new TF-IDF vectorizer.
-    # - min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-    # - ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-    # - sublinear_tf: use 1 + log(tf) instead of raw term frequency
-    #
+    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
+    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
+    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
+    # @rbs (?min_df: Integer | Float, ?max_df: Integer | Float,
+    #       ?ngram_range: Array[Integer], ?sublinear_tf: bool) -> void
+    def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
+      validate_df!(min_df, 'min_df')
+      validate_df!(max_df, 'max_df')
+      validate_ngram_range!(ngram_range)
+
+      @min_df = min_df
+      @max_df = max_df
+      @ngram_range = ngram_range
+      @sublinear_tf = sublinear_tf
+      @vocabulary = {}
+      @idf = {}
+      @num_documents = 0
+      @fitted = false
+    end
+
+    # Learns vocabulary and IDF weights from the corpus.
-    # Learns vocabulary and IDF weights from the corpus.
+    # @rbs (Array[String]) -> self
-    # Learns vocabulary and IDF weights from the corpus.
+    # @rbs (Array[String]) -> self
+    # @rbs (Array[String]) -> self
-    # Learns vocabulary and IDF weights from the corpus.
-    # @rbs (Array[String]) -> self
+    # @rbs (Array[String]) -> self
+    def fit(documents)
-    # Learns vocabulary and IDF weights from the corpus.
-    # @rbs (Array[String]) -> self
+    # @rbs (Array[String]) -> self
+    def fit(documents)
+    def fit(documents)
+      raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
-      raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
+      raise ArgumentError, 'documents cannot be empty' if documents.empty?
-      raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
+      raise ArgumentError, 'documents cannot be empty' if documents.empty?
+      raise ArgumentError, 'documents cannot be empty' if documents.empty?
+
+      @num_documents = documents.size
+      document_frequencies = Hash.new(0)
+
+      documents.each do |doc|
+        terms = extract_terms(doc)
+        terms.each_key { |term| document_frequencies[term] += 1 }
+      end
+
+      @vocabulary = {}
+      @idf = {}
+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
+      document_frequencies.each.with_index do |term_df, vocab_index|
+        term, df = term_df
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
-      @vocabulary = {}
-      @idf = {}
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
+      document_frequencies.each.with_index do |term_df, vocab_index|
+        term, df = term_df
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
-      @vocabulary = {}
-      @idf = {}
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
+      vocab_index = 0
+
+      document_frequencies.each do |term, df|
+        next unless within_df_bounds?(df, @num_documents)
+
+        @vocabulary[term] = vocab_index
+        vocab_index += 1
+
+        # IDF: log((N + 1) / (df + 1)) + 1
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
-        # IDF: log((N + 1) / (df + 1)) + 1
-        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+        # IDF with smoothing
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
-        # IDF: log((N + 1) / (df + 1)) + 1
-        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+        # IDF with smoothing
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+      end
-      @vocabulary = {}
-      @idf = {}
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
-
-        # IDF: log((N + 1) / (df + 1)) + 1
-        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
-      end
+      selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
+      
+      selected_terms.each.with_index do |(term, df), vocab_index|
+        @vocabulary[term] = vocab_index
+
+        # IDF with smoothing
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+      end
-      @vocabulary = {}
-      @idf = {}
-      vocab_index = 0
-
-      document_frequencies.each do |term, df|
-        next unless within_df_bounds?(df, @num_documents)
-
-        @vocabulary[term] = vocab_index
-        vocab_index += 1
-
-        # IDF: log((N + 1) / (df + 1)) + 1
-        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
-      end
+      selected_terms = document_frequencies.select { |term, df| within_df_bounds?(df, @num_documents) }
+      
+      selected_terms.each.with_index do |(term, df), vocab_index|
+        @vocabulary[term] = vocab_index
+
+        # IDF with smoothing
+        @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
+      end
+
+      @fitted = true
+      self
+    end
+
+    # Transforms a document into a normalized TF-IDF vector.
-    # Transforms a document into a normalized TF-IDF vector.
+    # @rbs (String) -> Hash[Symbol, Float]
-    # Transforms a document into a normalized TF-IDF vector.
+    # @rbs (String) -> Hash[Symbol, Float]
+    # @rbs (String) -> Hash[Symbol, Float]
-    # Transforms a document into a normalized TF-IDF vector.
-    # @rbs (String) -> Hash[Symbol, Float]
+    # @rbs (String) -> Hash[Symbol, Float]
+    def transform(document)
-    # Transforms a document into a normalized TF-IDF vector.
-    # @rbs (String) -> Hash[Symbol, Float]
+    # @rbs (String) -> Hash[Symbol, Float]
+    def transform(document)
+    def transform(document)
+      raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted
+
+      terms = extract_terms(document)
+      result = {} #: Hash[Symbol, Float]
+
+      terms.each do |term, tf|
+        next unless @vocabulary.key?(term)
+
+        tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f
+        result[term] = (tf_value * @idf[term]).to_f
+      end
+
+      normalize_vector(result)
+    end
+
+    # Fits and transforms in one step.
-    # Fits and transforms in one step.
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
-    # Fits and transforms in one step.
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
-    # Fits and transforms in one step.
-    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
+    def fit_transform(documents)
-    # Fits and transforms in one step.
-    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
+    # @rbs (Array[String]) -> Array[Hash[Symbol, Float]]
+    def fit_transform(documents)
+    def fit_transform(documents)
+      fit(documents)
+      documents.map { |doc| transform(doc) }
+    end
+
+    # Returns vocabulary terms in index order.
-    # Returns vocabulary terms in index order.
+    # @rbs () -> Array[Symbol]
-    # Returns vocabulary terms in index order.
+    # @rbs () -> Array[Symbol]
+    # @rbs () -> Array[Symbol]
-    # Returns vocabulary terms in index order.
-    # @rbs () -> Array[Symbol]
+    # @rbs () -> Array[Symbol]
+    def feature_names
-    # Returns vocabulary terms in index order.
-    # @rbs () -> Array[Symbol]
+    # @rbs () -> Array[Symbol]
+    def feature_names
+    def feature_names
+      @vocabulary.keys.sort_by { |term| @vocabulary[term] }
+    end
+
+    # @rbs () -> bool
+    def fitted?
+      @fitted
+    end
+
+    # @rbs (?untyped) -> Hash[Symbol, untyped]
+    def as_json(_options = nil)
+      {
+        version: 1,
+        type: 'tfidf',
+        min_df: @min_df,
+        max_df: @max_df,
+        ngram_range: @ngram_range,
+        sublinear_tf: @sublinear_tf,
+        vocabulary: @vocabulary,
+        idf: @idf,
+        num_documents: @num_documents,
+        fitted: @fitted
+      }
+    end
+
+    # @rbs (?untyped) -> String
+    def to_json(_options = nil)
+      JSON.generate(as_json)
+    end
+
+    # Loads a vectorizer from JSON.
-    # Loads a vectorizer from JSON.
+    # @rbs (String | Hash[String, untyped]) -> TFIDF
-    # Loads a vectorizer from JSON.
+    # @rbs (String | Hash[String, untyped]) -> TFIDF
+    # @rbs (String | Hash[String, untyped]) -> TFIDF
-    # Loads a vectorizer from JSON.
-    # @rbs (String | Hash[String, untyped]) -> TFIDF
+    # @rbs (String | Hash[String, untyped]) -> TFIDF
+    def self.from_json(json)
-    # Loads a vectorizer from JSON.
-    # @rbs (String | Hash[String, untyped]) -> TFIDF
+    # @rbs (String | Hash[String, untyped]) -> TFIDF
+    def self.from_json(json)
+    def self.from_json(json)
+      data = json.is_a?(String) ? JSON.parse(json) : json
+      raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf'
+
+      instance = new(
+        min_df: data['min_df'],
+        max_df: data['max_df'],
+        ngram_range: data['ngram_range'],
+        sublinear_tf: data['sublinear_tf']
+      )
+
+      instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary']))
+      instance.instance_variable_set(:@idf, symbolize_keys(data['idf']))
+      instance.instance_variable_set(:@num_documents, data['num_documents'])
+      instance.instance_variable_set(:@fitted, data['fitted'])
+
+      instance
+    end
+
+    # @rbs () -> Array[untyped]
+    def marshal_dump
+      [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted]
+    end
+
+    # @rbs (Array[untyped]) -> void
+    def marshal_load(data)
+      @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted = data
+    end
+
+    private
+
+    # @rbs (String) -> Hash[Symbol, Integer]
+    def extract_terms(document)
+      result = Hash.new(0)
+
+      if @ngram_range[0] <= 1
+        word_hash = document.clean_word_hash
+        word_hash.each { |term, count| result[term] += count }
+      end
+
+      return result if @ngram_range[1] <= 1
+
+      tokens = tokenize_for_ngrams(document)
+      (2..@ngram_range[1]).each do |n|
+        next if n < @ngram_range[0]
+
+        generate_ngrams(tokens, n).each { |ngram| result[ngram] += 1 }
+      end
+
+      result
+    end
+
+    # @rbs (String) -> Array[String]
+    def tokenize_for_ngrams(document)
+      document
+        .gsub(/[^\w\s]/, '')
+        .split
+        .map(&:downcase)
+        .reject { |w| w.length <= 2 || String::CORPUS_SKIP_WORDS.include?(w) }
+        .map(&:stem)
+    end
+
+    # @rbs (Array[String], Integer) -> Array[Symbol]
+    def generate_ngrams(tokens, n) # rubocop:disable Naming/MethodParameterName
+      return [] if tokens.size < n
+
+      tokens.each_cons(n).map { |gram| gram.join('_').intern }
+    end
+
+    # @rbs (Integer, Integer) -> bool
+    def within_df_bounds?(doc_freq, num_docs)
+      doc_freq.between?(
+        @min_df.is_a?(Float) ? (@min_df * num_docs).ceil : @min_df,
+        @max_df.is_a?(Float) ? (@max_df * num_docs).floor : @max_df
+      )
+    end
+
+    # @rbs (Hash[Symbol, Float]) -> Hash[Symbol, Float]
+    def normalize_vector(vector)
+      return vector if vector.empty?
+
+      magnitude = Math.sqrt(vector.values.sum { |v| v * v })
+      return vector if magnitude.zero?
+
+      vector.transform_values { |v| v / magnitude }
+    end
+
+    # @rbs (Integer | Float, String) -> void
+    def validate_df!(value, name)
+      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
+      raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
+      raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
+    end
-    # @rbs (Integer | Float, String) -> void
-    def validate_df!(value, name)
-      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
-      raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
-      raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
-    end
+    def validate_df!(value, name)
+      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
+      return if value.is_a?(Integer) && !value.negative?
+      return if value.is_a?(Float) && value.between?(0.0, 1.0)
+
+      raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
+    end
-    # @rbs (Integer | Float, String) -> void
-    def validate_df!(value, name)
-      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
-      raise ArgumentError, "#{name} must be between 0.0 and 1.0" if value.is_a?(Float) && !value.between?(0.0, 1.0)
-      raise ArgumentError, "#{name} must be non-negative" if value.is_a?(Integer) && value.negative?
-    end
+    def validate_df!(value, name)
+      raise ArgumentError, "#{name} must be an Integer or Float" unless value.is_a?(Float) || value.is_a?(Integer)
+      return if value.is_a?(Integer) && !value.negative?
+      return if value.is_a?(Float) && value.between?(0.0, 1.0)
+
+      raise ArgumentError, value.is_a?(Float) ? "#{name} must be between 0.0 and 1.0" : "#{name} must be non-negative"
+    end
+
+    # @rbs (Array[Integer]) -> void
+    def validate_ngram_range!(range)
+      raise ArgumentError, 'ngram_range must be an array of two integers' unless range.is_a?(Array) && range.size == 2
+      unless range.all?(Integer) && range.all?(&:positive?)
+        raise ArgumentError, 'ngram_range values must be positive integers'
+      end
-      unless range.all?(Integer) && range.all?(&:positive?)
-        raise ArgumentError, 'ngram_range values must be positive integers'
-      end
+      return if range.all?(Integer) && range.all?(&:positive?)
+
+      raise ArgumentError, 'ngram_range values must be positive integers'
-      unless range.all?(Integer) && range.all?(&:positive?)
-        raise ArgumentError, 'ngram_range values must be positive integers'
-      end
+      return if range.all?(Integer) && range.all?(&:positive?)
+
+      raise ArgumentError, 'ngram_range values must be positive integers'
+      raise ArgumentError, 'ngram_range[0] must be <= ngram_range[1]' if range[0] > range[1]
+    end
+
+    # @rbs (Hash[String, untyped]) -> Hash[Symbol, untyped]
+    def self.symbolize_keys(hash)
+      hash.transform_keys(&:to_sym)
+    end
+    private_class_method :symbolize_keys
+  end
+end