🚀🪐Planetary science Question Answering system with NLP Transformers

Scalable lightweight system that answers to questions from the inference corpus. Accelerated performance with lightweight encoder transformers.

The way this system functions is,

RoBERTa base squad2 checkpoint from Huggingface is finetuned on Natural Question (short version) for real-world diverse QA adaptation.
Then the zero-shot model is finetuned on NASA SMD training dataset towards domain adaptation.
Fused retriever with BM25 and SimCSE (FAISS) inspired by RAG.

Benefits of this approach:

Fast adaptation to smaller domain datasets while having the zero-shot baseline answerable capability.
Preprocessing steps to handle noisy unknown dataset to create a quality inference corpus.
Includes UMAP + unsupervised HDBSCAN clustering, and coarse filtering with local inference on Meta LLaMA 3.1 8B (can we swapped as needed).
Deployable backend with Flask API and frontend with React + tailwind.
TensorRT optimization steps for Encoder Transformers.

Refer System Architecture for more details on how the infernce works.

Getting started guide

Developed on Python version 3.12.5

Packages used: PyTorch v2.9 + cuda 12.9, matplotlib, Huggingface transformers v4.57.1, Huggingface datasets, Flask, Flask-smorest for swagger-ui

STEP 1

We recommended to use a virtual python environment, this can be done with;

python -m venv dl

STEP 2

Then on windows, the virtual env can be activated by

.\dl\Scripts\activate
python.exe -m pip install --upgrade pip

STEP 3

Afterwards, Install the packages from requirements.txt, can be done using;

pip install -r <project-dir>/requirements.txt

Then change dir to project dir and run below to install this :

pip install -e .

STEP 4

Then to locally download the models and datasets, acquire a read access token from your hugging face account and set it as an enviornment variable with:

set HF_TOKEN=yourtoken

For deployment only:

We uploaded our finetuned nq_nasa_v1 model to hugging face quantaRoche/roberta-finetuned-nq-nasa-qa therefore this is referenced in models/models.json and can be downloaded through the below method.

python pysrc/spacey_dev/setup.py --deploy-only=true

For new inference datasets and deployment:

Note: To donwload the Meta LLaMA 3.1 8B Instruct model prior permission should be requested from the linked source, this is a gated model.
But this step is not required unless there are plans to redo the coarse filtering. Therefore in models/models.json remove the meta llama object to skip downloading it.
This repository already provides the coarse filtering results in /reports which is sufficient for using the preprocessing pipeline with the datasets in this project.

Then run the below step to download the models and datasets, this process can take time depending on the network speed.

python pysrc/spacey_dev/setup.py

STEP 5

Run the preprocessor to reproduce the data needed for either deploymen or retraining, LLaMA ins't needed here.

python pysrc/spacey_dev/preprocessor/pipeline.py

STEP 6

Check if inference is working and that the model answers, Run:

python pysrc/spacey_dev/inference/run.py

STEP 7

Create an API_SEC to limit access to backend, use any kind of secret generator or use:

python -c "import secrets; print(secrets.token_hex(32));"

Then copy the value and set the API_SEC environmental variable which is read by the backend server.py:

SET API_SEC=secretkey

And add this key to /frontend/.env/VITE_API_SEC

Then, deploy the backend locally and test the deployment with swagger UI http://127.0.0.1:5000/api/docs

python pysrc/spacey_api/server.py

Ensure /api/health returns healthy and /api/device returns cuda, if it returns cpu then inference time would be slow.

Recommendation is to deploy the backend in an aws g6 instance, like g6f.2xlarge and ssh to check the status. But this can also be deployed on g4 or previous architectures.

We also included the tensorrt optimization code which converts the torch model to onnx to trt with assertion. Feel free to check it out in /pysrc/spacey_dev/optimization/

STEP 8

Frontend can be deployed anywhere needed, we did the deployment on vercel hobby plan and used their service-side functions for proxy, have a look at /frontend/api.

Here is the deployment architecture we followed for the demo

Below are the steps for vercel deployment, remember to sign up and create an account there.
Then locally in the frontend root path, run below to install the packages:

npm install

Then install vercel cli and run:

vercel

Which will require account authentication and then deploys the preview version.

Because our backend was locally deployed, we used cloudflared tunnel v10.0 and a public exposed aws s3 general purpose bucket to dynamically set the backend url without redeploying the frontend.

To follow the same steps, In AWS General purpose S3 bucket, add a file with this json:

{
    "BACKEND_BASE":"https://yourbackendurl"
}

And in frontend project, set the s3 json file url in .env for CONFIG_URL. The frontend will cache this on initial load and every 5mins /frontent/lib/config.ts.

Step 9 (optional)

If you want to try out tensorRT optmization on a Nvidia GPU, download the sdk from developer.nvidia.com and install their wheel.

For our project we use the default cp312 = python 3.12

pip install tensorrt-10.13.3.9-cp312-none-win_amd64.whl

Then run our optimizer code, which will produce the onnx model and tensort engine

python pysrc/spacey_dev/optimize/roberta.py

Our setup

Our local config is;

Component	Specification
OS	win11 26200.6901
CPU	i5 12600K
RAM	32GB DDR4
Storage	nvme ssd
GPU	1x RTX 4080 super
GPU Architecture	Ada Lovelace

Task: Question Answering
Language: Engligh
Local setup: 1x RTX 4080 super
Evaluation Metric: Squad_v2
Evaluation Dataset: nasa smd qa validation split

Hyperparameters

Finetune (1) Hyperparameters (NQ Dataset) zero-shot:

train_batch_size = 16
val_batch_size = 8
n_epochs = 2
base_LM_model = "roberta-base"
max_seq_len = 384
doc_stride=128
optimizer=adamW
last_layer_learning_rate=5e-6
qa_head_learning_rate=3e-5
release_model= "roberta-finetuned-nq"
gradient_checkpointing=True

Finetune (2) Hyperparameters (Nasa train):

train_batch_size = 16
val_batch_size = 8
n_epochs = 5
base_LM_model = "roberta-finetuned-nq"
max_seq_len = 384
doc_stride=128
optimizer=adamW
layer_learning_rate=1e-6
qa_head_learning_rate=1e-5
release_model= "roberta-finetuned-nq-nasa"
gradient_checkpointing=True

Performance

"exact": 66.0,
"f1": 79.86357273703948,
"total": 50,
"HasAns_exact": 53.333333333333336,
"HasAns_f1": 76.43928789506579,
"HasAns_total": 30,
"NoAns_exact": 85.0,
"NoAns_f1": 85.0,
"NoAns_total": 20

Contributors

Name	Role/Area
Matthew	Project Technical lead
Sugam	Data Analyst
Yuxuan	NQ data preprocessing, evaluation Metrics
Moe	Team communications
Kritika	Benchmark setup

Acknowledgements

Original RoBERTa paper authors
hugging face deepset/roberta-base-squad2 authors
hugging face cjlovering Natural Questions (short) Dataset
hugging face NASA SMD Dataset authors
Kaggle Spacenews Dataset and all the amazing authors of space news articles
Huggingface Trainer Documentation
Developed as a part of Westcliff MSCS AIT500 course 2025 Fall 1 session 2. We thank our professor Desmond for the continous guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀🪐Planetary science Question Answering system with NLP Transformers

Getting started guide

STEP 1

STEP 2

STEP 3

STEP 4

For deployment only:

For new inference datasets and deployment:

STEP 5

STEP 6

STEP 7

STEP 8

Step 9 (optional)

Our setup

Hyperparameters

Finetune (1) Hyperparameters (NQ Dataset) zero-shot:

Finetune (2) Hyperparameters (Nasa train):

Performance

Contributors

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docs		docs
frontend		frontend
models		models
pysrc		pysrc
query		query
reports		reports
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

matthew-roche/astro-spacey

Folders and files

Latest commit

History

Repository files navigation

🚀🪐Planetary science Question Answering system with NLP Transformers

Getting started guide

STEP 1

STEP 2

STEP 3

STEP 4

For deployment only:

For new inference datasets and deployment:

STEP 5

STEP 6

STEP 7

STEP 8

Step 9 (optional)

Our setup

Hyperparameters

Finetune (1) Hyperparameters (NQ Dataset) zero-shot:

Finetune (2) Hyperparameters (Nasa train):

Performance

Contributors

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages