NewsClassification

A small news classification project that uses a TF-IDF vectorizer and a trained classifier to predict article labels. This repository contains a Streamlit app for inference, pre-trained model artifacts, a sample dataset of scraped articles, and a notebook for experimentation.

streamlitApp.py — Streamlit front-end to classify news articles using the shipped model artifacts.
best_classifier.joblib — Serialized trained classifier (scikit-learn compatible).
tfidf_vectorizer.joblib — Serialized TF-IDF vectorizer used to transform article text.
label_encoder.joblib — Serialized label encoder used to map predicted classes back to label strings.
scraped_articles.csv — Example dataset of collected articles used for experimentation or retraining.
Untitled.ipynb — Notebook with experiments, data exploration, or example code (may be a work-in-progress).
requirements.txt — Python dependencies used by the project.

Quick start

Create a virtual environment (recommended) and activate it.

PowerShell (Windows):

python -m venv .venv; .\.venv\Scripts\Activate.ps1

Install dependencies:

pip install -r requirements.txt

Run the Streamlit app to classify news articles:

streamlit run .\streamlitApp.py

Open the URL printed by Streamlit (usually http://localhost:8501) in your browser.

Usage

The Streamlit app loads the tfidf_vectorizer.joblib, label_encoder.joblib, and best_classifier.joblib files to transform input text and produce a predicted label.
If you want to classify programmatically, a minimal example:

from joblib import load

vectorizer = load('tfidf_vectorizer.joblib')
clf = load('best_classifier.joblib')
le = load('label_encoder.joblib')

text = "Your article text here"
X = vectorizer.transform([text])
pred = clf.predict(X)
label = le.inverse_transform(pred)
print(label[0])

Retraining (notes)

This repository includes model artifacts but not necessarily the full training scripts. To retrain:

Inspect Untitled.ipynb for data preparation and model training hints.
Use scraped_articles.csv as your dataset or replace it with a cleaned dataset of your own.
Typical steps:
- Load and clean dataset (text, label columns).
- Fit a TF-IDF vectorizer (save with joblib).
- Train a classifier (e.g., LogisticRegression, SGDClassifier, RandomForest) on vectorized text.
- Encode labels (LabelEncoder) and save artifacts.

When retraining, ensure you preserve the same preprocessing steps used by streamlitApp.py so the vectorizer and classifier are compatible.

Files and purpose

best_classifier.joblib — classifier used by the app.
tfidf_vectorizer.joblib — vectorizer used to transform raw article text.
label_encoder.joblib — maps numeric predictions to human-readable labels.
scraped_articles.csv — sample data; useful for testing or retraining.
Untitled.ipynb — exploratory notebook; check for preprocessing and model examples.
requirements.txt — packages required (install with pip).

Tests and validation

There are no automated tests included. If you add training code or scripts, consider adding unit tests to validate:

Preprocessing functions (cleaning, tokenization)
Vectorizer serialization/deserialization
Model prediction shape and label mapping

Troubleshooting

Missing package errors: run pip install -r requirements.txt.
Streamlit cannot find model files: ensure you run the app from the repository root where the .joblib files are located or update paths in streamlitApp.py.
Different label mappings: if predictions look wrong after retraining, verify label_encoder.joblib matches your training labels.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
README.md		README.md
best_classifier.joblib		best_classifier.joblib
clustering.pdf		clustering.pdf
label_encoder.joblib		label_encoder.joblib
news_classification_exploration.ipynb		news_classification_exploration.ipynb
requirements.txt		requirements.txt
scraped_articles.csv		scraped_articles.csv
streamlitApp.py		streamlitApp.py
tfidf_vectorizer.joblib		tfidf_vectorizer.joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsClassification

Contents

Quick start

Usage

Retraining (notes)

Files and purpose

Tests and validation

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Akstrov/NewsClassification

Folders and files

Latest commit

History

Repository files navigation

NewsClassification

Contents

Quick start

Usage

Retraining (notes)

Files and purpose

Tests and validation

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages