Project Overview
- Purpose: Binary sentiment classification of tweets (positive / negative) using DistilBERT-based models trained with TensorFlow.
- Components: a Python
backendproviding a FastAPI inference service and afrontend(Next.js) UI. Training artifacts, notebooks, and evaluation figures live underbackend/Notebooksandbackend/Report.
Repository Layout
backend/: FastAPI inference app, model checkpoints, Dockerfile, and training notebooks.entry_main.py: FastAPI app that loads the saved DistilBERT checkpoint and exposes a/tweetPOST endpoint.requirements.txt: Python dependency list for backend runtime.Dockerfile,start.sh: containerization and startup script for the backend service.Models/: saved model checkpoints (e.g.chunk_model1,chunk_model2).Data/: raw and processed datasets (chunked CSVs for training).Notebooks/: training and preprocessing notebooks (containsDistillBert_Model.ipynb).Report/: generated training report and evaluation figures.
frontend/: Next.js application providing a UI (pages inapp/and staticpublic/).
Quick Start — Local (backend only)
- Clone repository:
git clone <repo-url> twitter-sentiment
cd twitter-sentiment/backend- Build Docker image (recommended for parity with deployment):
docker build -t twitter-sentiment-backend:latest .- Run container (mount local model checkpoint if needed):
docker run --rm -p 8080:8080 -e PORT=8080 -v "<path>:/app/Models/chunk_model2" twitter-sentiment-backend:latest- Test the API:
curl -X POST "http://localhost:8080/tweet" -H "Content-Type: application/json" -d '{"text":"I love this product!"}'Python Virtualenv (optional, without Docker)
Social media platforms produce large volumes of user-generated text. Understanding overall public sentiment about products, events, or topics is valuable for analytics, monitoring, and decision making. This project performs binary sentiment classification of Twitter posts (tweets) into positive or negative classes using a DistilBERT-based transformer model fine-tuned with TensorFlow.
Goals:
- Provide a reproducible training pipeline and evaluation artifacts.
- Expose an inference API for integration with a front-end or other services.
- Provide containerized deployment instructions (including Hugging Face Spaces compatibility).
backend/— FastAPI inference service, training notebooks, preprocessing utilities, model checkpoints, Dockerfile and container start script.frontend/— Next.js application (UI) that can call the backend inference API.backend/Data/— raw and processed datasets. Processed data is chunked for memory-efficient training.backend/Models/andbackend/Trained_weights/— saved model checkpoints and tf/h5 weights.backend/Notebooks/— Jupyter notebooks used for preprocessing, training and evaluation.backend/Report/— generated report and evaluation figures (confusion matrix, PR curve, loss curves).
- Clone the repo and open the
backendfolder:
git clone <repo-url>
cd Twitter-Sentiment/backend- (Recommended) Build the Docker image and run the service (ensures environment parity):
docker build -t twitter-sentiment-backend:latest .
# If your model checkpoints are stored locally, mount them into the container:
docker run --rm -p 8080:8080 -e PORT=8080 -v "<path>:/app/Models/chunk_model2" twitter-sentiment-backend:latest- Test the inference endpoint:
curl -X POST "http://localhost:8080/tweet" -H "Content-Type: application/json" -d '{"text":"I love this product!"}'- If you prefer not to use Docker, use a virtual environment:
python -m venv .venv; .\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
uvicorn entry_main:app --host 0.0.0.0 --port 8080- Install dependencies
# from backend/
python -m venv .venv; .\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt- Prepare data
- Raw data:
backend/Data/raw/train.csv. - Processed chunks:
backend/Data/processed/Chunks/chunk_*.csv. Use the notebookbackend/Notebooks/preprocessing.ipynbto reproduce the same cleaning/tokenization steps used for training.
- Train a model (notebook)
- Open
backend/Notebooks/DistillBert_Model.ipynband follow the training pipeline. The notebook includes dataset loading (viaNotebooks/utils.py), model instantiation, training loops and evaluation. - Save checkpoints into
backend/Models/<checkpoint_name>/using thetransformerssave methods, or exporth5weights tobackend/Trained_weights/.
- Evaluate
- The notebook saves evaluation plots to
backend/Report/Figures/(e.g., PR curve, F1 curve, confusion matrix). Inspectbackend/Report/Model_Training_Report.mdfor interpretation and guidance.
- Run inference
- The inference FastAPI app is
backend/entry_main.py. It expects to find model files underbackend/Models/chunk_model2when starting. The app exposesPOST /tweetwhich accepts JSON with atextfield and responds withsentimentandconfidence.
- Model:
TFDistilBertForSequenceClassificationfine-tuned for binary classification. - Tokenizer:
transformers.AutoTokenizer.from_pretrained('distilbert-base-uncased')(loaded viaNotebooks/utils.py). - Input preprocessing: mentions replaced with
@user, URLs removed, hashtags simplified, and whitespace normalized (seeentry_main.clean_text). - Inference: the model returns logits; code applies
tf.nn.sigmoidon logits and thresholds at0.5to selectpositiveornegative.
Two common ways to deploy to Spaces:
-
- Include model files in the Space repo (only for small models): commit
backend/Models/chunk_model2into the Space repo and include thebackend/Dockerfile. Spaces will build the Docker image and runstart.sh.
- Include model files in the Space repo (only for small models): commit
-
- Host model checkpoints on the Hugging Face Hub and load them at container start (recommended for larger models): push your model and tokenizer to the Hub using
model.push_to_hub(...)andtokenizer.push_to_hub(...)from a notebook, then load from the hub inentry_main.pyor in a short startup script.
- Host model checkpoints on the Hugging Face Hub and load them at container start (recommended for larger models): push your model and tokenizer to the Hub using
Example to push and load from the Hub:
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('your-username/twitter-sentiment-distilbert')
model = TFDistilBertForSequenceClassification.from_pretrained('your-username/twitter-sentiment-distilbert')When targeting GPU Spaces, use a CUDA-enabled TensorFlow wheel and appropriate base image; modify the Dockerfile to install GPU drivers and the correct tensorflow package.
- Prepare and clean your dataset using
backend/Notebooks/preprocessing.ipynb. - Open
backend/Notebooks/DistillBert_Model.ipynband set hyperparameters (batch size, epochs, learning rate). - Train and monitor the training/validation loss and metrics.
- Save the model checkpoint and tokenizer with
model.save_pretrained(...)andtokenizer.save_pretrained(...)intobackend/Models/<name>/. - Optionally export
.h5weights tobackend/Trained_weights/for storage.
- Example curl request:
curl -X POST "http://localhost:8080/tweet" -H "Content-Type: application/json" -d '{"text":"This is an awesome feature"}'
# Example response
#{"sentiment": "positive", "confidence": 0.936}- The
frontend/Next.js app can call the backend/tweetendpoint. Ensure CORS is allowed (backend already includes permissive CORS middleware inentry_main.py). If you host frontend and backend separately, setallow_originsto your frontend host for production.
- Save a
training-config.jsonwith hyperparameters, random seeds and dataset versions alongside each checkpoint. - Use
transformerspush_to_hubfor model artifacts to avoid large repository sizes. - Add a small smoke-test dataset and a
train_smoke.shto validate environment correctness in CI.
- Model load errors: verify
Models/chunk_model2exists or adjustentry_main.pyto download from the Hub at startup. - Slow startup / memory pressure: use smaller models, reduce
max_lengthor batch sizes during training, or use GPU-accelerated instances. - Windows Docker volume mounts: on Windows use absolute Windows paths and ensure Docker has access to the drive.
- Fork the repo and create feature branches. Open PRs against
main. - Add tests for preprocessing and a small unit test for the API endpoint behavior.
- (Add your license or company policy here.)
backend/entry_main.py— inference API.backend/Notebooks/DistillBert_Model.ipynb— training notebook.backend/Report/Model_Training_Report.md— analysis and figures.
If you'd like, I can now:
- run a syntax check across
backend/*.py, - build and run the Docker image locally and test the endpoint, or
- extract training hyperparameters from
backend/Notebooks/DistillBert_Model.ipynband createbackend/Models/training-config.jsonfor reproducibility.
This section documents implementation-level details for the repository. If you plan to modify, reproduce, or debug any part of the project, read this section carefully.
Top-level layout (exact paths referenced below):
backend/— main Python service, training notebooks, data and model artifacts.frontend/— Next.js UI that interacts with the backend inference API.
-
Purpose: provide a lightweight HTTP API to classify a single tweet text.
-
Key imports and why they matter:
FastAPIandCORSMiddleware: create the web app and configure CORS for browser-based frontends.pydantic: enforces request payload structure; prevents malformed requests from reaching inference code.Notebooks.utils: central utility module (tokenizer loading, plotting helpers, model loaders). Importing it allows code re-use between training notebooks and the API.tensorflow as tf: used for numerical ops (sigmoid) and for loading TF-based checkpoints.transformers:TFDistilBertForSequenceClassificationis the model class used for inference;AutoTokenizeris used by the utils helper.
-
App initialization:
app = FastAPI()creates the ASGI app.app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"])sets permissive CORS for dev; changeallow_originsto a whitelist for prod.
-
Model & tokenizer loading (at import time):
model = TFDistilBertForSequenceClassification.from_pretrained('./Models/chunk_model2')- This call expects a HF-style directory containing either a TF checkpoint or the serialized weights/config. If files are missing this raises an error and the process will not start.
- If your checkpoint was saved as an HDF5 (
.h5) file only, adapt this section to callmodel.load_weightsand provide the path.
tokenizer = utils.load_tokenizer('distilbert-base-uncased')- This uses
Notebooks/utils.load_tokenizerwhich delegates toAutoTokenizer.from_pretrained. - Replace the argument with your HF hub repo id if you pushed tokenizer to the hub.
- This uses
-
Input validation model:
class TwitterTweet(pydantic.BaseModel): text: str— returns 422 errors automatically iftextis missing or not a string.
-
clean_text(text):- Steps:
- Replace mentions
@usernamewith@userviare.sub(r'@\S+', '@user', text)— reduces token sparsity and anonymizes. - Remove URLs via
re.sub(r'http\S+|www\S+', '', text). - Remove
#from hashtags, preserving the word viare.sub(r'#(\w+)', r'\1', text). - Normalize whitespace with
re.sub(r'\s+', ' ', text).strip().
- Replace mentions
- Rationale: these steps reduce noisy tokens, reduce vocabulary, and align training and inference text cleaning.
- Steps:
-
POST /tweet(detailed request/response flow):- Request JSON ->
TwitterTweetmodel validated. - Text cleaned with
clean_text. - Tokenization:
tokenizer(cleaned_text, return_tensors='tf', padding='max_length', truncation=True, max_length=128)max_length=128chosen for balance between context and memory; adjust if needed.
- Inference:
outputs = model.predict(inputs);logits = outputs.logits.- Note:
model.predictruns eager inference; for higher throughput considermodel.callundertf.functionor use batching.
- Note:
- Probability:
prob = tf.nn.sigmoid(logits)[0][0].numpy()— interprets logits as single-logit binary output. If your model returns two logits (class scores), usetf.nn.softmaxand argmax instead. - Decision:
sentiment = 'positive' if prob >= 0.5 else 'negative'and respond{"sentiment": sentiment, "confidence": float(prob)}.
- Request JSON ->
-
Plotting helpers (for notebooks):
plot_accuracy(train_acc, val_acc),plot_loss(train_loss, val_loss),plot_confusion_matrix(y_true, y_pred, classes=[0,1]),plot_precision_recall(y_true, y_scores),plot_f1_score(y_true, y_scores),plot_dataset_distribution(texts, targets)— these wrap matplotlib/seaborn for quick visualization.
-
Model loading helpers:
load_bert_model(num_labels)andload_distilbert_model(num_labels)load pre-trained models from the HF model hub and setnum_labels(useful for training from scratch with custom label counts).load_bert_checkpoint(checkpoint_path, num_labels)andload_distilbert_checkpoint(checkpoint_path, num_labels)are helpers to load from local checkpoint directories. Pay attention tofrom_pt=Trueflags; if you trained in PyTorch and converted to TF,from_pt=Truehelps.
-
Tokenizer loader:
load_tokenizer(model_name)returnsAutoTokenizer.from_pretrained(model_name); themodel_namemay be a hub id or a local folder path.
-
load_dataframe(file_index):- Reads a chunked CSV from a hard-coded Colab path:
/content/drive/MyDrive/Colab Notebooks/backend/Data/processed/Chunks/chunk_{file_index}.csvand maps labels4 -> 1. - Important: that path is Colab-specific. For local runs update the path to
backend/Data/processed/Chunks/chunk_{file_index}.csvor use a config variable.
- Reads a chunked CSV from a hard-coded Colab path:
fastapi— HTTP API framework.uvicorn[standard]— ASGI server for running FastAPI.pydantic— data validation for request bodies.tensorflow==2.13.4— model runtime; pinned to this version in the repo to ensure compatibility with saved checkpoints. If you need GPU support, install a CUDA-enabled TF wheel matching your GPU driver.transformers— model and tokenizer utilities from HF.scikit-learn— metrics and helpers used in notebooks (confusion matrix, PR curve).pandas— CSV reading and dataset handling.matplotlib,seaborn— plotting utilities used by the notebooks.
When editing requirements.txt:
- Pin versions for reproducibility (especially
tensorflowandtransformers). - Use
pip freeze > requirements.txtfrom a known-good environment after you verify training/inference works.
FROM python:3.10-slim— minimal Python base image.ENV PYTHONDONTWRITEBYTECODE=1andENV PYTHONUNBUFFERED=1— standard env flags to avoid .pyc files and force stdout/stderr flushing.ENV PORT=8080— default served port.RUN apt-get update && apt-get install -y --no-install-recommends build-essential git wget curl ca-certificates libglib2.0-0 libgomp1— system deps required by some Python packages;libgomp1often required for optimized TF builds.WORKDIR /app— set working dir.COPY backend/requirements.txt ./requirements.txtandRUN pip install --upgrade pip setuptools wheel && pip install --no-cache-dir -r requirements.txt— installs Python packages.COPY backend/ ./— copy all backend files into image.RUN chmod +x ./start.sh || true— ensuresstart.shis executable.EXPOSE 8080— document port.HEALTHCHECK ... CMD curl -f http://localhost:8080/ || exit 1— basic health check (requirescurlbe installed; if not, adjust or remove).CMD ["bash", "start.sh"]— default container command runsstart.sh.
Important notes about the Docker image:
- If you need GPU support, use a CUDA-enabled base image and a GPU TensorFlow wheel; the slim image is CPU-only.
- Model artifacts are copied into the image by
COPY backend/ ./. Large model files may inflate image size — best practice is to store model on HF Hub or mount at runtime.
- The script sets
PORT=${PORT:-8080}, warns if./Models/chunk_model2is missing (common runtime error), then executesuvicorn entry_main:app --host 0.0.0.0 --port ${PORT} --workers 1. - For production scale, consider
--workers> 1 (but be careful with TF models and multi-process memory usage). Alternatives: use an ASGI process manager or TensorFlow Serving for scaled inference.
- HF-style checkpoints: a directory containing
config.jsonand TF weights (e.g.,tf_model.h5) can be loaded withTFDistilBertForSequenceClassification.from_pretrained(path). Trained_weights/*.h5are Keras HDF5 weight files — to load, instantiate model architecture and callmodel.load_weights(path).
Data/raw/train.csv: original dataset (likely labeled with Twitter sentiment classes 0 and 4). Inspect header and encoding when loading.Data/processed/Chunks/chunk_*.csv: chunked, preprocessed CSVs created for memory-efficient training. Use the preprocessing notebook to reproduce chunking.- Label mapping:
utils.load_dataframemaps4 -> 1so final labels are{0,1}binary.
- Typical steps in the notebook:
- Load chunked dataset with
utils.load_dataframeor a custom data loader. - Create tokenizer & encode datasets via
tokenizer(..., truncation=True, padding='max_length', max_length=128). - Instantiate model via
TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=1, from_pt=True)(ornum_labels=2) depending on approach. - Compile model with an optimizer (e.g., AdamW via
transformersortf.keras.optimizers.Adam) and appropriate loss (binary_crossentropyfor single-logit orsparse_categorical_crossentropyfor 2-logit outputs). - Train with
model.fit(...)and save best checkpoints using callbacks. - Evaluate and create plots saved to
backend/Report/Figures/.
- Load chunked dataset with
Important: the notebook contains the canonical hyperparameters — extract them and record in a JSON file for reproducibility.
- The Next.js app living under
frontend/callsPOST /tweeton the backend to get sentiment predictions. It usesfetchor similar methods to call the endpoint and displaysentimentandconfidence. - CORS: the backend currently allows all origins; tighten this in production.
Model_Training_Report.mdcontains training summary, data notes, checkpoint locations and evaluation figure embeds (PR curve, F1 curve, confusion matrix, loss curves).- Figures:
Report/Figures/contains.jpeg/.pngvisuals. Use them to select thresholds or to inspect class imbalance and error modes.
-
Decide where to store model weights:
- Option A: Commit small checkpoints to the repo and build Docker images that include weights. Pros: simple; Cons: large repo and large images.
- Option B (recommended): Push model & tokenizer to the Hugging Face Hub and download at container startup, or load directly from the hub in
entry_main.pywithfrom_pretrained('username/model').
-
Build Docker image (local test):
cd backend
docker build -t twitter-sentiment-backend:latest .- Run container and mount models at runtime (if not baked into image):
docker run --rm -p 8080:8080 -e PORT=8080 -v "<path>:/app/Models/chunk_model2" twitter-sentiment-backend:latest- Validate endpoint with a curl command (example):
curl -X POST "http://localhost:8080/tweet" -H "Content-Type: application/json" -d '{"text":"I love this product!"}'- Hugging Face Spaces: create a Space with Docker support, push repo (or a dedicated Space repo) and configure to build. If checkpoint size prevents direct push, host model on HF Hub and load on startup.
- Unit tests: add tests for
clean_text, tokenization behavior, and a small integration test forPOST /tweetusingstarlette.testclient.TestClient. - Load testing: run a small load test with
heyorwrkto understand latency under concurrent requests. - Observability: log predictions (anonymized), confidences, and model latency. Save to a central logging system.
- Scaling: for large traffic, consider model batching or a dedicated inference server (TF Serving or Triton). Alternatively increase
uvicornworkers but watch memory usage.
training-config.jsoncontaining: dataset version, training/validation split, seed, optimizer & learning rate schedule, batch size, epochs, max_length, and tokenizer name.- A small subset of processed data (
smoke_dataset.csv) for CI tests. - A
requirements.txtsnapshot and optionally apip-tools/poetryspec for deterministic installs.
- Model load errors: ensure the checkpoint folder is present and contains expected files (
config.json,tf_model.h5,pytorch_model.bindepending on format). If you have an HDF5h5file, callmodel.load_weightsafter model instantiation instead offrom_pretrained. - OOM at inference: lower
max_length, reduce concurrency, or use a larger machine with GPU. - Tokenizer mismatch: ensure the tokenizer used at inference matches the tokenizer used during training (vocabulary & pre-tokenization must match).
- Anonymize user identifiers (the code replaces
@usernamewith@user). If you log inputs, strip or hash personal data and follow data retention rules.
- Run a syntax check across all
backend/*.pyfiles and report errors. - Build and run the Docker image locally and run a smoke test against
/tweet. - Extract hyperparameters from the training notebook and generate
backend/Models/training-config.json. - Add unit tests and a GitHub Actions CI workflow to run them on pull requests.
End of detailed README.