Pre-trained machine learning models for the Cadmium NLP library.
This repository contains pre-trained models that can be easily loaded and used with Cadmium classifiers and other NLP components. Models are stored in efficient binary format (MessagePack) for fast loading and minimal storage.
Each model may include:
.model- Binary MessagePack format (recommended, fastest loading).model.json- JSON format (human-readable, fallback)metadata.yml- Model information, training data stats, accuracy metrics
require "cadmium_classifier"
# Load from binary format (fastest)
bytes = File.read("path/to/model.model")
classifier = Cadmium::Classifier::Bayes.from_msgpack(bytes)
# Or load from JSON
classifier = Cadmium::Classifier::Bayes.from_json(File.read("path/to/model.model.json"))
# Use the classifier
result = classifier.classify("Your text here")- sentiment_twitter - English sentiment classification (positive/negative)
- Trained on: Sentiment140 Twitter dataset (50K samples, 40K for training)
- Accuracy: 75.05%
- Precision: 79.04% (positive), 71.04% (negative)
- Recall: 73.29% (positive), 77.12% (negative)
- F1 Score: 76.06% (positive), 73.95% (negative)
- Categories: 2 (positive, negative)
- Vocabulary: 59,982 words
- Training time: 1.12 seconds
- Source: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip
require "cadmium_classifier"
# Load the model
bytes = File.read("models/sentiment/sentiment_twitter.model")
classifier = Cadmium::Classifier::Bayes.from_msgpack(bytes)
# Classify text
result = classifier.classify("I love this new feature!")
# => {"positive" => 96.67, "negative" => 3.33}
# Get just the top category
category = classifier.classify_category("This is terrible!")
# => "negative"The simplest way to train a new model:
# Prepare your training data (tab-separated: text<TAB>category)
cat > data/training_data.txt << EOF
I love this product! positive
This is terrible. negative
Amazing experience! positive
Worst service ever negative
EOF
# Train the model
crystal run src/cadmium_models.cr train data/training_data.txt my_model
# Move generated files to models directory
mv my_model.model my_model.model.json metadata.yml models/<category>/Training data must be tab-separated with the text content first and the category last:
Text content here category1
Another text sample category2
Multi-word text here category1
Important: Use tabs (\t) to separate the text from the category, not commas or spaces.
The CLI provides a simple interface for training models:
crystal run src/cadmium_models.cr train <data_file> <model_name>Features:
- Automatic 80/20 train/test split
- Computes accuracy, precision, recall, F1 scores
- Generates confusion matrix
- Auto-creates metadata.yml with all metrics
- Exports both MessagePack and JSON formats
Example:
crystal run src/cadmium_models.cr train data/spam_data.txt spam_detectorOutput:
🚀 Starting model training...
📚 Total samples: 10000
📚 Training samples: 8000
🧪 Test samples: 2000
⏳ Training...
✅ Training completed in 0.45 seconds
📊 Accuracy: 98.2%
📦 Generated files:
- spam_detector.model (245 KB)
- spam_detector.model.json (312 KB)
- metadata.yml
After training, verify your model works correctly:
crystal run src/cadmium_models.cr test models/sentiment/sentiment_twitter.modelThis will load the model and run sample predictions to verify it's working.
Organize trained models by category:
models/
├── sentiment/
│ ├── sentiment_twitter.model
│ ├── sentiment_twitter.model.json
│ └── metadata.yml
├── spam/
│ ├── email_spam.model
│ ├── email_spam.model.json
│ └── metadata.yml
└── language/
├── lang_detector.model
├── lang_detector.model.json
└── metadata.yml
- Data Quality - Use clean, labeled data. Remove duplicates and obvious errors.
- Dataset Size - More data is generally better, but 10K-50K samples is often sufficient for good results.
- Balanced Classes - Try to have roughly equal samples per category for best accuracy.
- Test Split - Always reserve 20-30% of data for testing to validate performance.
- Metadata - Keep metadata.yml accurate and complete for model discoverability.
See CONTRIBUTING.md for guidelines on:
- Submitting models for inclusion
- Model versioning and release process
- Code quality standards
Models are released under the MIT license unless otherwise noted in individual model metadata.