Scalable lightweight system that answers to questions from the inference corpus. Accelerated performance with lightweight encoder transformers.
The way this system functions is,
- RoBERTa base squad2 checkpoint from Huggingface is finetuned on Natural Question (short version) for real-world diverse QA adaptation.
- Then the zero-shot model is finetuned on NASA SMD training dataset towards domain adaptation.
- Fused retriever with BM25 and SimCSE (FAISS) inspired by RAG.
Benefits of this approach:
- Fast adaptation to smaller domain datasets while having the zero-shot baseline answerable capability.
- Preprocessing steps to handle noisy unknown dataset to create a quality inference corpus.
- Includes UMAP + unsupervised HDBSCAN clustering, and coarse filtering with local inference on Meta LLaMA 3.1 8B (can we swapped as needed).
- Deployable backend with Flask API and frontend with React + tailwind.
- TensorRT optimization steps for Encoder Transformers.
Refer System Architecture for more details on how the infernce works.
Developed on Python version 3.12.5
Packages used: PyTorch v2.9 + cuda 12.9, matplotlib, Huggingface transformers v4.57.1, Huggingface datasets, Flask, Flask-smorest for swagger-ui
We recommended to use a virtual python environment, this can be done with;
python -m venv dl
Then on windows, the virtual env can be activated by
.\dl\Scripts\activate
python.exe -m pip install --upgrade pip
Afterwards, Install the packages from requirements.txt, can be done using;
pip install -r <project-dir>/requirements.txt
Then change dir to project dir and run below to install this :
pip install -e .
Then to locally download the models and datasets, acquire a read access token from your hugging face account and set it as an enviornment variable with:
set HF_TOKEN=yourtoken
We uploaded our finetuned nq_nasa_v1 model to hugging face quantaRoche/roberta-finetuned-nq-nasa-qa therefore this is referenced in models/models.json and can be downloaded through the below method.
python pysrc/spacey_dev/setup.py --deploy-only=true
Note: To donwload the Meta LLaMA 3.1 8B Instruct model prior permission should be requested from the linked source, this is a gated model.
But this step is not required unless there are plans to redo the coarse filtering. Therefore in models/models.json remove the meta llama object to skip downloading it.
This repository already provides the coarse filtering results in /reports which is sufficient for using the preprocessing pipeline with the datasets in this project.
Then run the below step to download the models and datasets, this process can take time depending on the network speed.
python pysrc/spacey_dev/setup.py
Run the preprocessor to reproduce the data needed for either deploymen or retraining, LLaMA ins't needed here.
python pysrc/spacey_dev/preprocessor/pipeline.py
Check if inference is working and that the model answers, Run:
python pysrc/spacey_dev/inference/run.py
Create an API_SEC to limit access to backend, use any kind of secret generator or use:
python -c "import secrets; print(secrets.token_hex(32));"
Then copy the value and set the API_SEC environmental variable which is read by the backend server.py:
SET API_SEC=secretkeyAnd add this key to /frontend/.env/VITE_API_SEC
Then, deploy the backend locally and test the deployment with swagger UI http://127.0.0.1:5000/api/docs
python pysrc/spacey_api/server.pyEnsure /api/health returns healthy and /api/device returns cuda, if it returns cpu then inference time would be slow.
Recommendation is to deploy the backend in an aws g6 instance, like g6f.2xlarge and ssh to check the status. But this can also be deployed on g4 or previous architectures.
We also included the tensorrt optimization code which converts the torch model to onnx to trt with assertion. Feel free to check it out in /pysrc/spacey_dev/optimization/
Frontend can be deployed anywhere needed, we did the deployment on vercel hobby plan and used their service-side functions for proxy, have a look at /frontend/api.
Here is the deployment architecture we followed for the demo
Below are the steps for vercel deployment, remember to sign up and create an account there.
Then locally in the frontend root path, run below to install the packages:
npm installThen install vercel cli and run:
vercelWhich will require account authentication and then deploys the preview version.
Because our backend was locally deployed, we used cloudflared tunnel v10.0 and a public exposed aws s3 general purpose bucket to dynamically set the backend url without redeploying the frontend.
To follow the same steps, In AWS General purpose S3 bucket, add a file with this json:
{
"BACKEND_BASE":"https://yourbackendurl"
}And in frontend project, set the s3 json file url in .env for CONFIG_URL. The frontend will cache this on initial load and every 5mins /frontent/lib/config.ts.
If you want to try out tensorRT optmization on a Nvidia GPU, download the sdk from developer.nvidia.com and install their wheel.
For our project we use the default cp312 = python 3.12
pip install tensorrt-10.13.3.9-cp312-none-win_amd64.whl
Then run our optimizer code, which will produce the onnx model and tensort engine
python pysrc/spacey_dev/optimize/roberta.py
Our local config is;
| Component | Specification |
|---|---|
| OS | win11 26200.6901 |
| CPU | i5 12600K |
| RAM | 32GB DDR4 |
| Storage | nvme ssd |
| GPU | 1x RTX 4080 super |
| GPU Architecture | Ada Lovelace |
Task: Question Answering
Language: Engligh
Local setup: 1x RTX 4080 super
Evaluation Metric: Squad_v2
Evaluation Dataset: nasa smd qa validation split
train_batch_size = 16
val_batch_size = 8
n_epochs = 2
base_LM_model = "roberta-base"
max_seq_len = 384
doc_stride=128
optimizer=adamW
last_layer_learning_rate=5e-6
qa_head_learning_rate=3e-5
release_model= "roberta-finetuned-nq"
gradient_checkpointing=True
train_batch_size = 16
val_batch_size = 8
n_epochs = 5
base_LM_model = "roberta-finetuned-nq"
max_seq_len = 384
doc_stride=128
optimizer=adamW
layer_learning_rate=1e-6
qa_head_learning_rate=1e-5
release_model= "roberta-finetuned-nq-nasa"
gradient_checkpointing=True
"exact": 66.0,
"f1": 79.86357273703948,
"total": 50,
"HasAns_exact": 53.333333333333336,
"HasAns_f1": 76.43928789506579,
"HasAns_total": 30,
"NoAns_exact": 85.0,
"NoAns_f1": 85.0,
"NoAns_total": 20
| Name | Role/Area |
|---|---|
| Matthew | Project Technical lead |
| Sugam | Data Analyst |
| Yuxuan | NQ data preprocessing, evaluation Metrics |
| Moe | Team communications |
| Kritika | Benchmark setup |
- Original RoBERTa paper authors
- hugging face deepset/roberta-base-squad2 authors
- hugging face cjlovering Natural Questions (short) Dataset
- hugging face NASA SMD Dataset authors
- Kaggle Spacenews Dataset and all the amazing authors of space news articles
- Huggingface Trainer Documentation
- Developed as a part of Westcliff MSCS AIT500 course 2025 Fall 1 session 2. We thank our professor Desmond for the continous guidance.