IEEE Papers Mapper

Overview

IEEE Papers Mapper is a comprehensive tool for retrieving, processing, classifying, and visualizing research papers from the IEEE Xplore API. It automates data ingestion, applies machine learning for classification, and offers interactive dashboards for insights.

Badges

Demo

Description

The IEEE Papers Mapper is a comprehensive pipeline designed to automate the retrieval, processing, classification, and visualization of academic papers sourced from the IEEE Xplore digital library. This tool streamlines research management by automatically fetching papers based on user-defined queries, preprocessing the raw data to extract key metadata, and employing an encoder-only machine learning model to classify papers into predefined categories. The results are stored in a robust SQLite database and visualized through a Plotly Dash web app. The project integrates APScheduler for scheduled data retrieval, ensuring that the pipeline remains up-to-date. It is highly configurable, allowing users to define custom thresholds, categories, and schedules, making it a valuable resource for researchers and data professionals aiming to organize vast volumes of academic literature efficiently.

Key Features

Automated Data Retrieval: Scheduled fetching of research papers using APScheduler.
Data Processing: Cleans, formats, and prepares data for analysis.
Machine Learning Classification: Zero-shot classification using transformer models.
Interactive Dashboard: Visualize categorized papers and insights using Plotly Dash.

Installation

Prerequisites

Python 3.12+
Virtual Environment (optional but recommended)
Required tools: pip, git

Steps (for Usage)

Create a project directory:

mkdir ~/workspace/my_project
cd ~/workspace/my_project

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # For Linux/Mac
venv\Scripts\activate     # For Windows

Install the pip package and start using it at will:
```
pip install ieee-papers-mapper
```

Steps (for Development)

Clone the repository

git clone https://github.com/alex-anast/ieee-papers-mapper.git
cd ieee-papers-mapper

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # For Linux/Mac
venv\Scripts\activate     # For Windows

Install the required packages:
```
pip install -r requirements.txt
```
Install the package locally:
```
pip install .
```

Usage

Running the Application

Dashboard

To launch the dashboard, run:

python ieee_papers_mapper/app/dash_webapp.py

Visit http://localhost:8050 to view the dashboard.

Data Pipeline

To run the pipeline of retrieving, processing and classifying the papers automatically, execute:

python ieee_papers_mapper/main.py --days 1

NOTE: Currently the scheduler is commented out. The pipeline runs must be executed manually.

Functionality

Data Retrieval: Automatically fetches new papers based on categories from IEEE Xplore.
Data Processing: Handles missing columns and formats data for classification.
Classification: Uses a DeBERTa-v3 model for zero-shot classification into predefined categories.
Data Storage: Uses SQLite3 for storing the data in an SQL database (scalability, modularity over CSV files).

Documentation

Link to Docs

Complete documentation is available at: https://alex-anast.com/ieee-papers-mapper/

Code structure

./ieee-pappers-mapper
├── conftest.py
├── docs                                # MkDocs
│   ├── about.md
│   ├── developer_guide
│   │   ├── api_reference.md
│   │   └── code_structure.md
│   ├── index.md
│   └── user_guide
│       ├── installation.md
│       ├── overview.md
│       └── usage.md
├── LICENSE
├── mkdocs.yml                          # MkDocs config
├── pyproject.toml
├── README.md
├── requirements.txt
├── setup.py
├── src
│   └── ieee_papers_mapper
│       ├── app                         # Web App (plotly dash)
│       │   ├── assets
│       │   │   └── styles.css
│       │   ├── callbacks.py
│       │   ├── dash_webapp.py
│       │   └── __init__.py
│       ├── config                      # Config and util files
│       │   ├── config.py
│       │   ├── progress.json
│       │   └── scheduler.py            # Custom scheduler wrapper class
│       ├── data
│       │   ├── classify_papers.py      # Classification
│       │   ├── database.py             # Custom Database wrapper class
│       │   ├── get_papers.py           # Paper retrieval
│       │   ├── __init__.py
│       │   ├── pipeline.py             # Pipeline actions
│       │   └── process_papers.py       # Paper (pre)processing
│       ├── ieee_papers.db
│       ├── __init__.py
│       └── main.py
└── tests
    ├── __init__.py
    ├── test_classify_papers.py
    ├── test_database.py
    ├── test_get_papers.py
    └── test_process_papers.py

Testing

Run the tests with:

python -m pytest

Testing Coverage

get_papers.py: Validates API integration and error handling.
process_papers.py: Ensures data cleaning and formatting.
classify_papers.py: Verifies ML classification accuracy and runtime performance.
database.py: Checks database initialization and CRUD operations.

Contributing

Guidelines

Fork the repository and submit a pull request.
Adhere to PEP 8 code style.
Include unit tests for new core functionality.
Lint with black formatter.

Roadmap

Future Features

Currently author index terms is not consistent, and therefore commented out. Fix.
Scheduler is not enabled.
Add more advanced ML models for classification.
Enhance the dashboard with dynamic filtering.

Actionable TODOs to Turn to Enterprise Level

These make the whole app a very robust product. Anything else is more ML engineering. Some more abstract ideas are given below:

Feed each retrieved paper in a paid, good LLM for data labelling to create a comprehensive dataset
Experiment with different techniques via a notebook on how to classify as best as I can. Yes the pretrained model could work, but I am wondering if it would be interesting to work with SVMs (considering the low amount of data) and give a real-time classification capability
I wonder if vectorisation and RAG could help here in any way. Maybe a north star could be some type of optimised retrieval, in the sense that we make our own database and then Q&A so that we can discuss based on the papers that have been already classified.

Known Issues

Limited to 20 API calls/day and to max 200 papers/call, due to IEEE Xplore API restrictions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

Owner: Alexandros Anastasiou
Email: anastasioyaa@gmail.com
Website: TODO
LinkedIn: TODO

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
data/raw		data/raw
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
mkdocs.yml		mkdocs.yml
presentation_for_demo.odp		presentation_for_demo.odp
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

IEEE Papers Mapper

Overview

Badges

Table of Contents

Demo

Description

Key Features

Installation

Prerequisites

Steps (for Usage)

Steps (for Development)

Usage

Running the Application

Dashboard

Data Pipeline

Functionality

Documentation

Link to Docs

Code structure

Testing

Testing Coverage

Contributing

Guidelines

Roadmap

Future Features

Actionable TODOs to Turn to Enterprise Level

Known Issues

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages