-
Notifications
You must be signed in to change notification settings - Fork 123
Open
Labels
enhancementNew feature or requestNew feature or requestpriority: lowLow priority - nice to haveLow priority - nice to have
Milestone
Description
Summary
The current implementation requires loading all training data into memory before classification. For large corpora, this is impractical.
Current Limitation
# Must load everything into memory
classifier = Classifier::Bayes.new('Spam', 'Ham')
File.foreach('huge_corpus.txt') do |line|
classifier.train_spam(line) # Memory grows unbounded
endProposed Enhancement
1. Streaming training from IO
classifier.train_from_stream(:spam, File.open('spam_corpus.txt'))
classifier.train_from_stream(:ham, File.open('ham_corpus.txt'))2. Batch training with memory limits
classifier.train_batch(:spam, documents, batch_size: 1000) do |progress|
puts "Trained #{progress.completed}/#{progress.total}"
end3. Serialization checkpoints
# Save partial training state
classifier.save_checkpoint('classifier_50pct.dump')
# Resume training later
classifier = Classifier::Bayes.load_checkpoint('classifier_50pct.dump')
classifier.train_from_stream(:ham, remaining_docs)4. For LSI: Incremental SVD
Consider implementing incremental/online SVD algorithms that can update the decomposition without full recomputation:
- Brand's incremental SVD
- Streaming PCA approaches
Use Cases
- Training on multi-GB text corpora
- Continuous learning systems
- Memory-constrained environments
Impact
Severity: Low - enhancement for specific use cases, not a bug
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority: lowLow priority - nice to haveLow priority - nice to have