Skip to content

Akstrov/NewsClassification

Repository files navigation

NewsClassification

A small news classification project that uses a TF-IDF vectorizer and a trained classifier to predict article labels. This repository contains a Streamlit app for inference, pre-trained model artifacts, a sample dataset of scraped articles, and a notebook for experimentation.

Contents

  • streamlitApp.py — Streamlit front-end to classify news articles using the shipped model artifacts.
  • best_classifier.joblib — Serialized trained classifier (scikit-learn compatible).
  • tfidf_vectorizer.joblib — Serialized TF-IDF vectorizer used to transform article text.
  • label_encoder.joblib — Serialized label encoder used to map predicted classes back to label strings.
  • scraped_articles.csv — Example dataset of collected articles used for experimentation or retraining.
  • Untitled.ipynb — Notebook with experiments, data exploration, or example code (may be a work-in-progress).
  • requirements.txt — Python dependencies used by the project.

Quick start

  1. Create a virtual environment (recommended) and activate it.

PowerShell (Windows):

python -m venv .venv; .\.venv\Scripts\Activate.ps1
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the Streamlit app to classify news articles:
streamlit run .\streamlitApp.py

Open the URL printed by Streamlit (usually http://localhost:8501) in your browser.

Usage

  • The Streamlit app loads the tfidf_vectorizer.joblib, label_encoder.joblib, and best_classifier.joblib files to transform input text and produce a predicted label.
  • If you want to classify programmatically, a minimal example:
from joblib import load

vectorizer = load('tfidf_vectorizer.joblib')
clf = load('best_classifier.joblib')
le = load('label_encoder.joblib')

text = "Your article text here"
X = vectorizer.transform([text])
pred = clf.predict(X)
label = le.inverse_transform(pred)
print(label[0])

Retraining (notes)

This repository includes model artifacts but not necessarily the full training scripts. To retrain:

  • Inspect Untitled.ipynb for data preparation and model training hints.
  • Use scraped_articles.csv as your dataset or replace it with a cleaned dataset of your own.
  • Typical steps:
    • Load and clean dataset (text, label columns).
    • Fit a TF-IDF vectorizer (save with joblib).
    • Train a classifier (e.g., LogisticRegression, SGDClassifier, RandomForest) on vectorized text.
    • Encode labels (LabelEncoder) and save artifacts.

When retraining, ensure you preserve the same preprocessing steps used by streamlitApp.py so the vectorizer and classifier are compatible.

Files and purpose

  • best_classifier.joblib — classifier used by the app.
  • tfidf_vectorizer.joblib — vectorizer used to transform raw article text.
  • label_encoder.joblib — maps numeric predictions to human-readable labels.
  • scraped_articles.csv — sample data; useful for testing or retraining.
  • Untitled.ipynb — exploratory notebook; check for preprocessing and model examples.
  • requirements.txt — packages required (install with pip).

Tests and validation

There are no automated tests included. If you add training code or scripts, consider adding unit tests to validate:

  • Preprocessing functions (cleaning, tokenization)
  • Vectorizer serialization/deserialization
  • Model prediction shape and label mapping

Troubleshooting

  • Missing package errors: run pip install -r requirements.txt.
  • Streamlit cannot find model files: ensure you run the app from the repository root where the .joblib files are located or update paths in streamlitApp.py.
  • Different label mappings: if predictions look wrong after retraining, verify label_encoder.joblib matches your training labels.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published