PubMedSimplify

PubMedSimplify is a modern, full-stack web application that enables users to easily summarize and structure PubMed abstracts using a state-of-the-art deep learning model. The app is designed for medical researchers, clinicians, and students who want to quickly extract key information from scientific literature.

🚀 Project Overview

PubMedSimplify leverages a custom multimodal neural network, built on top of BERT embeddings trained on PubMed sentences, to classify each sentence in a medical abstract into its semantic section (e.g., Background, Methods, Results, Conclusions, Objective). The web interface allows users to paste, upload, or link PubMed abstracts and receive a structured, easy-to-read summary.

✨ Features

Paste, upload, or link PubMed abstracts for instant summarization.
Drag-and-drop support for .txt files.
Structured output: Each sentence is classified into sections (Background, Methods, Results, Conclusions, Objective).
Modern, responsive UI with dark/light mode, animated transitions, and floating cards.
Real-time progress and typing effects for a smooth user experience.
Recent abstracts gallery (optional).
Backend powered by Flask and TensorFlow/Keras.
Advanced data pipelines and training callbacks for robust model performance.

🧠 Model Architecture

The core of PubMedSimplify is a multimodal neural network inspired by Dernoncourt et al., 2017 (PubMed 200k RCT). The model combines token, character, and positional embeddings for robust sequential sentence classification.

Model Components

Token Inputs (BERT Embeddings)
- Input: Raw sentence strings.
- Preprocessing: Custom preprocessing layer.
- Embedding: PubMed-trained BERT model (bert_layer).
- Output: 128-dimensional dense layer.
Character Inputs
- Input: Sentence as a string.
- Vectorization: Character-level vectorizer and embedding.
- Sequence Modeling: Bidirectional LSTM (32 units).
- Output: Character-level sentence representation.
Line Number Inputs
- Input: One-hot encoded line number (shape: 15).
- Output: 32-dimensional dense layer.
Total Lines Inputs
- Input: One-hot encoded total lines (shape: 20).
- Output: 32-dimensional dense layer.
Hybrid Embedding
- Concatenate token and character embeddings.
- Dense layer (256 units, ReLU) + Dropout (0.5).
Tribrid Embedding
- Concatenate hybrid embedding with positional embeddings (line number, total lines).
Output Layer
- Dense layer with 5 units (softmax) for section classification.
Model Assembly
- All inputs and outputs are combined into a single Keras model.

Model Code (Summary)

# Token input branch
token_inputs = layers.Input(shape=[], dtype="string", name="token_inputs")
preprocessed_inputs = preprocessing_layer(token_inputs)
bert_token_embeddings = bert_layer(preprocessed_inputs)["pooled_output"]
token_outputs = layers.Dense(128, activation="relu")(bert_token_embeddings)
token_model = tf.keras.Model(inputs=token_inputs, outputs=token_outputs)

# Char input branch
char_inputs = layers.Input(shape=(1,), dtype="string", name="char_inputs")
char_vectors = char_vectorizer(char_inputs)
char_embeddings = char_embed(char_vectors)
char_bi_lstm = layers.Bidirectional(layers.LSTM(32))(char_embeddings)
char_model = tf.keras.Model(inputs=char_inputs, outputs=char_bi_lstm)

# Line number branch
line_number_inputs = layers.Input(shape=(15,), dtype=tf.float32, name="line_number_input")
x = layers.Dense(32, activation="relu")(line_number_inputs)
line_number_model = tf.keras.Model(inputs=line_number_inputs, outputs=x)

# Total lines branch
total_lines_inputs = layers.Input(shape=(20,), dtype=tf.float32, name="total_lines_input")
y = layers.Dense(32, activation="relu")(total_lines_inputs)
total_lines_model = tf.keras.Model(inputs=total_lines_inputs, outputs=y)

# Combine embeddings
combined_embeddings = layers.Concatenate(name="token_char_hybrid_embedding")([token_model.output, char_model.output])
z = layers.Dense(256, activation="relu")(combined_embeddings)
z = layers.Dropout(0.5)(z)
z = layers.Concatenate(name="token_char_positional_embedding")([line_number_model.output, total_lines_model.output, z])

# Output
output_layer = layers.Dense(5, activation="softmax", name="output_layer")(z)

# Final model
model = tf.keras.Model(
    inputs=[line_number_model.input, total_lines_model.input, token_model.input, char_model.input],
    outputs=output_layer
)

🏋️ Training Details

Dataset: PubMed 200k RCT (2.3M sentences, 200k abstracts) [arXiv:1710.06071]
Mixed precision training for speed and efficiency.
Advanced data pipelines for efficient preprocessing and batching.
Callbacks:
- ModelCheckpoint for best model saving.
- EarlyStopping to prevent overfitting.
- ReduceLROnPlateau for dynamic learning rate adjustment.
Achieved ~90% accuracy on the test set.

📊 Data & Model Sources

Where our data is coming from:
We use the PubMed 200k RCT dataset, specifically the 20k subset, which is a large, publicly available dataset for sequential sentence classification in medical abstracts. This dataset consists of thousands of PubMed abstracts, each sentence labeled as background, objective, method, result, or conclusion.

Dataset paper: PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Where our model is coming from:
Our model architecture is inspired by the work in
Neural networks for joint sentence classification in medical paper abstracts (Dernoncourt et al., 2017), which explores deep learning approaches for classifying sentences in medical abstracts.

Model paper: Neural networks for joint sentence classification in medical paper abstracts

Dataset repository:

https://github.com/Franck-Dernoncourt/pubmed-rct

🖥️ Tech Stack

Frontend: HTML, Tailwind CSS, JavaScript (with modern UX features)
Backend: Python, Flask, TensorFlow/Keras, TensorFlow Hub
Deployment: Easily run locally or deploy to any WSGI-compatible server

📦 Installation

Clone the repository

git clone https://github.com/Anand-Velpuri/PubMedSimplify.git
cd PubMedSimplify

Install dependencies
```
pip install -r requirements.txt
```
Run the app
```
python app.py
```
Visit http://localhost:5000 in your browser.

📚 References

Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2017). PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. arXiv preprint arXiv:1710.06071.

🤝 Contributing

Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.

📝 License

This project is licensed under the MIT License.

Let me know if you want to add badges, screenshots, or further details!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
model		model
static		static
templates		templates
.gitattributes		.gitattributes
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.py		app.py
railway.toml		railway.toml
requirements.txt		requirements.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PubMedSimplify

🚀 Project Overview

✨ Features

🧠 Model Architecture

Model Components

Model Code (Summary)

🏋️ Training Details

📊 Data & Model Sources

🖥️ Tech Stack

📦 Installation

📚 References

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Languages

License

Anand-Velpuri/PubMedSimplify

Folders and files

Latest commit

History

Repository files navigation

PubMedSimplify

🚀 Project Overview

✨ Features

🧠 Model Architecture

Model Components

Model Code (Summary)

🏋️ Training Details

📊 Data & Model Sources

🖥️ Tech Stack

📦 Installation

📚 References

🤝 Contributing

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages