A small news classification project that uses a TF-IDF vectorizer and a trained classifier to predict article labels. This repository contains a Streamlit app for inference, pre-trained model artifacts, a sample dataset of scraped articles, and a notebook for experimentation.
streamlitApp.py— Streamlit front-end to classify news articles using the shipped model artifacts.best_classifier.joblib— Serialized trained classifier (scikit-learn compatible).tfidf_vectorizer.joblib— Serialized TF-IDF vectorizer used to transform article text.label_encoder.joblib— Serialized label encoder used to map predicted classes back to label strings.scraped_articles.csv— Example dataset of collected articles used for experimentation or retraining.Untitled.ipynb— Notebook with experiments, data exploration, or example code (may be a work-in-progress).requirements.txt— Python dependencies used by the project.
- Create a virtual environment (recommended) and activate it.
PowerShell (Windows):
python -m venv .venv; .\.venv\Scripts\Activate.ps1- Install dependencies:
pip install -r requirements.txt- Run the Streamlit app to classify news articles:
streamlit run .\streamlitApp.pyOpen the URL printed by Streamlit (usually http://localhost:8501) in your browser.
- The Streamlit app loads the
tfidf_vectorizer.joblib,label_encoder.joblib, andbest_classifier.joblibfiles to transform input text and produce a predicted label. - If you want to classify programmatically, a minimal example:
from joblib import load
vectorizer = load('tfidf_vectorizer.joblib')
clf = load('best_classifier.joblib')
le = load('label_encoder.joblib')
text = "Your article text here"
X = vectorizer.transform([text])
pred = clf.predict(X)
label = le.inverse_transform(pred)
print(label[0])This repository includes model artifacts but not necessarily the full training scripts. To retrain:
- Inspect
Untitled.ipynbfor data preparation and model training hints. - Use
scraped_articles.csvas your dataset or replace it with a cleaned dataset of your own. - Typical steps:
- Load and clean dataset (text, label columns).
- Fit a TF-IDF vectorizer (save with joblib).
- Train a classifier (e.g., LogisticRegression, SGDClassifier, RandomForest) on vectorized text.
- Encode labels (LabelEncoder) and save artifacts.
When retraining, ensure you preserve the same preprocessing steps used by streamlitApp.py so the vectorizer and classifier are compatible.
best_classifier.joblib— classifier used by the app.tfidf_vectorizer.joblib— vectorizer used to transform raw article text.label_encoder.joblib— maps numeric predictions to human-readable labels.scraped_articles.csv— sample data; useful for testing or retraining.Untitled.ipynb— exploratory notebook; check for preprocessing and model examples.requirements.txt— packages required (install with pip).
There are no automated tests included. If you add training code or scripts, consider adding unit tests to validate:
- Preprocessing functions (cleaning, tokenization)
- Vectorizer serialization/deserialization
- Model prediction shape and label mapping
- Missing package errors: run
pip install -r requirements.txt. - Streamlit cannot find model files: ensure you run the app from the repository root where the
.joblibfiles are located or update paths instreamlitApp.py. - Different label mappings: if predictions look wrong after retraining, verify
label_encoder.joblibmatches your training labels.