Skip to content

Add streaming/incremental training support #69

@cardmagic

Description

@cardmagic

Summary

The current implementation requires loading all training data into memory before classification. For large corpora, this is impractical.

Current Limitation

# Must load everything into memory
classifier = Classifier::Bayes.new('Spam', 'Ham')
File.foreach('huge_corpus.txt') do |line|
  classifier.train_spam(line)  # Memory grows unbounded
end

Proposed Enhancement

1. Streaming training from IO

classifier.train_from_stream(:spam, File.open('spam_corpus.txt'))
classifier.train_from_stream(:ham, File.open('ham_corpus.txt'))

2. Batch training with memory limits

classifier.train_batch(:spam, documents, batch_size: 1000) do |progress|
  puts "Trained #{progress.completed}/#{progress.total}"
end

3. Serialization checkpoints

# Save partial training state
classifier.save_checkpoint('classifier_50pct.dump')

# Resume training later
classifier = Classifier::Bayes.load_checkpoint('classifier_50pct.dump')
classifier.train_from_stream(:ham, remaining_docs)

4. For LSI: Incremental SVD

Consider implementing incremental/online SVD algorithms that can update the decomposition without full recomputation:

  • Brand's incremental SVD
  • Streaming PCA approaches

Use Cases

  • Training on multi-GB text corpora
  • Continuous learning systems
  • Memory-constrained environments

Impact

Severity: Low - enhancement for specific use cases, not a bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority: lowLow priority - nice to have

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions