Skip to content

Wozacosta/classificator

Repository files navigation

classificator

CI NPM Licence shield NPM release version shield

A fast, lightweight Naive Bayes classifier for Node.js with explainable predictions. Written in TypeScript with full type declarations. Ships dual CJS/ESM.

                    +-----------------+
   "great movie" -->|  classificator  |--> { predictedCategory: "positive", proba: 0.83 }
                    +-----------------+
                      |  trained on  |
                      |  your data   |
                      +--------------+

What can I use this for?

You can use this for categorizing any text content into any arbitrary set of categories. For example:

  • is an email spam, or not spam ?
  • is a news article about technology, politics, or sports ?
  • is a piece of text expressing positive emotions, or negative emotions?
                          +----------+
                     +--->| positive | 0.72
   "awesome movie"   |   +----------+
         |            |   +----------+
         v            +-->| negative | 0.18
   [ tokenize ]      |   +----------+
         |            |   +----------+
         v            +-->|  neutral | 0.10
   [ calculate ]------+   +----------+
   [ probability ]

More here: https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Installing

Recommended: Node v18.0.0 +

npm install classificator

Quick Start

// ESM (recommended)
import bayes from 'classificator'

// or with named imports
import { Naivebayes, fromJson } from 'classificator'

// CJS (still works)
const bayes = require('classificator')
const classifier = bayes()

// Train
classifier.learn('amazing, awesome movie!', 'positive')
classifier.learn('terrible, boring film', 'negative')

// Classify
const result = classifier.categorize('awesome film')
console.log(result.predictedCategory) // => 'positive'

TypeScript

Full type declarations are included. All interfaces are exported:

import bayes from 'classificator'
import type {
  NaivebayesOptions,    // constructor options
  CategorizeResult,     // return type of categorize()
  Likelihood,           // single category likelihood entry
  InfluentialToken,     // return type of topInfluentialTokens()
  BatchItem,            // { text, category } for learnBatch()
  CategoryStats,        // per-category stats
  CategoryStatsResult,  // return type of getCategoryStats()
} from 'classificator'

const options: NaivebayesOptions = { alpha: 0.5, fitPrior: false }
const classifier = bayes(options)

classifier.learn('great movie', 'positive')
const result: CategorizeResult = classifier.categorize('great')

You can also import the class directly:

import { Naivebayes } from 'classificator'
const classifier = new Naivebayes({ alpha: 0.5 })

How It Works

Classificator uses the Naive Bayes algorithm with Laplace smoothing. Here's the pipeline:

  Input Text
      |
      v
+-------------+     +------------------+     +-------------------+
|  Tokenizer  |---->|  Preprocessor    |---->|  Frequency Table  |
| split words |     | stopwords/stem   |     |  count each word  |
+-------------+     +------------------+     +-------------------+
                                                      |
              +---------------------------------------+
              |
              v
+---------------------------+     +------------------+
|  For each category:       |     |  Normalize with  |
|  P(cat) * P(w1|cat) *    |---->|  logsumexp for   |
|  P(w2|cat) * ...          |     |  final proba     |
+---------------------------+     +------------------+
                                          |
                                          v
                                  +------------------+
                                  |  Return sorted   |
                                  |  likelihoods +   |
                                  |  predictedCategory|
                                  +------------------+

Laplace smoothing prevents zero-probability issues — even words never seen in a category get a small probability instead of zeroing everything out.

Usage

Teach your classifier

classifier.learn('amazing, awesome movie! Had a good time', 'positive')
classifier.learn('Buy my free viagra pill and get rich!', 'spam')
classifier.learn('I really hate dust and annoying cats', 'negative')
classifier.learn('LOL this sucks so hard', 'troll')

Batch learning

classifier.learnBatch([
  { text: 'amazing, awesome movie!', category: 'positive' },
  { text: 'Buy my free viagra pill', category: 'spam' },
  { text: 'I really hate dust', category: 'negative' }
])

Make your classifier unlearn

classifier.learn('i hate mornings', 'positive');
// uh oh, that was a mistake. Time to unlearn
classifier.unlearn('i hate mornings', 'positive');

If the last document in a category is unlearned, the category is automatically removed.

Remove a category

classifier.removeCategory('troll');

Categorization

classifier.categorize("I've always hated Martians");
// => {
//      likelihoods: [
//        { category: 'negative', proba: 0.538, logLikelihood: -17.24, logProba: -0.62 },
//        { category: 'positive', proba: 0.269, logLikelihood: -17.94, logProba: -1.31 },
//        { category: 'spam',     proba: 0.193, logLikelihood: -18.27, logProba: -1.65 }
//      ],
//      predictedCategory: 'negative'
//    }

Categorize with confidence threshold

Reject low-confidence predictions instead of guessing:

classifier.categorizeWithConfidence('some ambiguous text', 0.7);
// => predictedCategory is null if the top probability is below 0.7
//    likelihoods array is always returned in full
   "ambiguous text"
         |
         v
   [ categorize ]
         |
    proba = 0.42
         |
    0.42 < 0.70 ?  --yes-->  predictedCategory: null    (rejected)
         |
        no
         |
         v
    predictedCategory: "spam"   (accepted)

Get top N categories

classifier.categorizeTopN("I've always hated Martians", 2);
// => same as categorize(), but likelihoods array has at most 2 entries

Understand why a prediction was made

classifier.topInfluentialTokens("I've always hated Martians", 3);
// => [
//      { token: 'hated', probability: 0.42, frequency: 1 },
//      { token: 'always', probability: 0.21, frequency: 1 },
//      { token: 'Martians', probability: 0.12, frequency: 1 }
//    ]
  "I've always hated Martians"  -->  predicted: negative
                                          |
      Why?                                v
      +----------------------------------------------------+
      | Token     | P(token|negative) | Influence          |
      |-----------|-------------------|--------------------|
      | hated     | 0.42              | ################## |
      | always    | 0.21              | #########          |
      | Martians  | 0.12              | #####              |
      +----------------------------------------------------+

Serialize / Deserialize

// Save
let stateJson = classifier.toJson()

// Restore
let revivedClassifier = bayes.fromJson(stateJson)

stateJson can be a JSON string or a plain object.

Important: Functions (tokenizer, tokenPreprocessor) can't be serialized to JSON. Pass them back when restoring:

let revivedClassifier = bayes.fromJson(stateJson, {
  tokenizer: myTokenizer,
  tokenPreprocessor: myPreprocessor
})
  Classifier                     JSON String                    Classifier
  (in memory)                    (on disk)                      (restored)
       |                              |                              |
       +--- toJson() --------------->|                              |
       |                              +--- fromJson(json, opts) --->|
       |                              |          ^                   |
       |    tokenizer: fn  -  LOST    |          |                   |
       |    alpha: 0.5     -  KEPT    |    pass functions            |
       |    fitPrior: true -  KEPT    |    back in opts              |
       |                              |                              |

Inspect your classifier

classifier.getCategories()
// => ['positive', 'spam', 'negative', 'troll']

classifier.getCategoryStats()
// => {
//      positive: { docCount: 1, wordCount: 7, vocabularySize: 7 },
//      spam:     { docCount: 1, wordCount: 8, vocabularySize: 8 },
//      ...
//      _total:   { docCount: 4, wordCount: 25, vocabularySize: 20 }
//    }

Reset the classifier

classifier.reset()
// clears all learned data but preserves options (tokenizer, alpha, fitPrior)

Method chaining

Most methods return this, so you can chain calls:

const result = bayes()
  .learn('happy fun', 'positive')
  .learn('sad bad', 'negative')
  .categorize('happy')

API

let classifier = bayes([options])

Returns an instance of a Naive-Bayes Classifier.

Option Type Default Description
tokenizer Function Splits on whitespace/punct Custom tokenization function. Receives text (string), must return an array of string tokens.
tokenPreprocessor Function none Transform tokens after tokenization (e.g. stopword removal, stemming, lowercasing). Receives and returns an array of tokens.
alpha number 1 Additive (Laplace) smoothing parameter. Higher values = more conservative predictions. 0 disables smoothing (can cause zero-probability issues).
fitPrior boolean true If true, prior probability is proportional to learned document frequencies (categories with more training docs are favored). If false, uses uniform prior (all categories equally likely before seeing the text).
let classifier = bayes({
    tokenizer: function (text) { return text.split(' ') },
    tokenPreprocessor: function (tokens) {
      var stopwords = new Set(['the', 'a', 'is', 'in'])
      return tokens
        .map(function (t) { return t.toLowerCase() })
        .filter(function (t) { return !stopwords.has(t) })
    },
    alpha: 0.5,
    fitPrior: false
})

Understanding alpha (Laplace smoothing)

  alpha controls how much probability "leaks" to unseen words:

  alpha = 0     Unseen words get 0 probability. Risky.
  alpha = 0.5   Lidstone smoothing. Less aggressive.
  alpha = 1     Standard Laplace. Good default.      <-- default
  alpha = 10    Very conservative. Small datasets.

  Effect on P(word|category):

         P(word|cat) = (count + alpha) / (total + alpha * vocabSize)
                        ──────────────   ─────────────────────────────
                        numerator gets    denominator grows with alpha
                        a boost           spreading probability to all
                                          possible words

Understanding fitPrior

  fitPrior: true (default)         fitPrior: false
  ─────────────────────────         ────────────────────────
  P(cat) = docCount / total         P(cat) = 1  (uniform)

  900 positive docs + 100 negative   Same data, but:
  P(positive) = 0.9                  P(positive) = P(negative)
  P(negative) = 0.1                  Only word content matters

  Good when training data            Good when training data
  reflects real-world                 is imbalanced but you want
  distribution                       fair comparison

classifier.learn(text, category)

Teach your classifier what category should be associated with a text string.

Returns this for chaining. Throws TypeError if text or category is not a string.

classifier.learnBatch(items)

Learn from multiple text/category pairs at once. items is an array of { text, category } objects.

Returns this for chaining. Throws TypeError if items is not an array.

classifier.unlearn(text, category)

The classifier will unlearn the text that was associated with category. If the last document in a category is unlearned, the category is automatically removed.

Returns this for chaining. Throws Error if the category does not exist.

classifier.removeCategory(category)

The category is removed and the classifier data are updated accordingly. Vocabulary is cleaned up: tokens only present in the removed category are removed from the global vocabulary. No-op if the category does not exist.

Returns this for chaining.

classifier.categorize(text)

Returns {Object} with predictedCategory and likelihoods array sorted by probability (highest first). Returns { predictedCategory: null, likelihoods: [] } if no categories have been learned.

{
    likelihoods: [
      { category: 'positive', logLikelihood: -17.94, logProba: -1.31, proba: 0.27 },
      ...
    ],
    predictedCategory: 'negative'
}

classifier.categorizeWithConfidence(text, threshold)

Like categorize(), but sets predictedCategory to null if the top category's probability is below threshold (a number between 0 and 1). The likelihoods array is always returned in full. Throws TypeError if threshold is invalid.

classifier.categorizeTopN(text, n)

Like categorize(), but returns only the top n most likely categories in the likelihoods array.

classifier.topInfluentialTokens(text[, n])

Returns the top n (default 5) tokens that most influenced the predicted category, sorted by probability. Each entry has { token, probability, frequency }.

classifier.getCategories()

Returns an array of all category names the classifier has learned.

classifier.getCategoryStats()

Returns an object with per-category stats (docCount, wordCount, vocabularySize) and a _total key with aggregate stats including total wordCount.

classifier.reset()

Resets the classifier to its initial untrained state, preserving configuration options.

Returns this for chaining.

classifier.toJson()

Returns the JSON representation of a classifier.

let classifier = bayes.fromJson(jsonStr[, options])

Returns a classifier instance from the JSON representation. Use this with classifier.toJson().

jsonStr can be a JSON string or a plain object.

options is an optional object for runtime-only options (e.g. { tokenizer: fn, tokenPreprocessor: fn }) that cannot be serialized to JSON.


Typical Workflows

Spam Filter

  +-----------+     +-----------+     +-------------+     +--------+
  | Collect   |---->| Train     |---->| Serialize   |---->| Deploy |
  | emails    |     | classifier|     | to JSON     |     | in app |
  +-----------+     +-----------+     +-------------+     +--------+
                         |                                     |
                    learn('buy now        fromJson(saved) then
                     free!!!', 'spam')    categorize(newEmail)
                    learn('meeting at
                     3pm', 'ham')

Sentiment Analysis with Preprocessing

const classifier = bayes({
  tokenPreprocessor: (tokens) => {
    const stops = new Set(['the', 'a', 'is', 'it', 'and', 'of', 'to'])
    return tokens
      .map(t => t.toLowerCase())
      .filter(t => !stops.has(t) && t.length > 2)
  }
})

// Train on labeled reviews
reviews.forEach(r => classifier.learn(r.text, r.sentiment))

// Classify new review
const result = classifier.categorize('This product is absolutely amazing!')
if (result.likelihoods[0].proba > 0.7) {
  console.log(`Confident: ${result.predictedCategory}`)
} else {
  console.log('Uncertain, needs human review')
}

Model Persistence

const fs = require('fs')

// Save trained model
fs.writeFileSync('model.json', classifier.toJson())

// Load later
const saved = fs.readFileSync('model.json', 'utf8')
const classifier = bayes.fromJson(saved, { tokenizer: myTokenizer })

Test Suite

The library includes a comprehensive test suite with 121 tests (powered by Vitest):

  Unit tests (82)        - Individual method correctness, edge cases,
                           parameter validation, numerical stability

  Integration tests (7)  - Feature combinations: serialize/restore pipelines,
                           learn/unlearn/relearn cycles, preprocessor
                           consistency, method chaining workflows

  E2E tests (20)         - Real-world scenarios: spam detection, sentiment
                           analysis, multi-category topic classification,
                           incremental learning, mistake correction,
                           imbalanced dataset handling

  Dist tests (12)        - Verify compiled output: CJS require, ESM import,
                           named exports, type declarations, round-trips

Run with:

npm test

Changelog

1.0.0

TypeScript rewrite:

  • Full TypeScript source with exported types (NaivebayesOptions, CategorizeResult, Likelihood, InfluentialToken, CategoryStats, BatchItem)
  • Dual CJS/ESM output via tsup — require() and import both work
  • Type declarations (.d.ts) included for TypeScript consumers
  • ES6 class-based implementation (same API, better types)

Modern tooling:

  • Build: tsup (esbuild-based, fast)
  • Test: Vitest (replaces Mocha)
  • CI: Node 18/20/22 with typecheck + build + test steps

Breaking changes:

  • Minimum Node version raised to 18.0.0 (14 and 16 are EOL)
  • Named ESM imports available: import { Naivebayes, fromJson } from 'classificator'

0.5.0

New features:

  • tokenPreprocessor option for stopword removal, stemming, and custom token transforms
  • categorizeWithConfidence(text, threshold) for rejecting low-confidence predictions
  • topInfluentialTokens(text, n) for explainable classification
  • getCategories(), categorizeTopN(), learnBatch(), reset(), getCategoryStats()
  • Input validation on all public methods (throws TypeError for non-string inputs)

Bug fixes:

  • Fixed alpha: 0 being silently overridden to 1
  • Fixed fromJson(null) crash
  • Fixed unlearn() not cleaning up categories when last document is removed
  • Fixed unlearn() crash on non-existent category
  • Fixed categorize() crash on empty classifier (now returns predictedCategory: null)
  • Fixed default tokenizer returning empty tokens for empty strings
  • Fixed removeCategory() not guarding against negative vocabulary counts
  • Fixed wordCount going negative in unlearn() edge cases
  • Fixed logsumexp numerical instability (now uses max-subtraction trick)
  • Fixed fromJson() losing runtime options after state restoration
  • Fixed error message typo and inconsistent capitalization

Improvements:

  • Numerically stable logsumexp prevents underflow on large documents
  • Tokenizer and tokenPreprocessor validation at construction time
  • getCategoryStats() now includes wordCount in _total
  • GitHub Actions CI for Node 14/16/18/20
  • Comprehensive test suite (121 tests: unit + integration + E2E + dist)
  • Improved JSDoc and README documentation with diagrams

0.4.0

  • Allow custom tokenizer to be passed to fromJson()

0.3.4

  • Initial tracked version

About

naive bayes classifier

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors