Skip to content

A deep learning-based pipeline for predicting and annotating Antibiotic Resistance Genes (ARGs) from protein sequences.

License

Notifications You must be signed in to change notification settings

muneebdev7/ARGPrism_dev

 
 

Repository files navigation

ARG-PRISM

Conda pip GPU

Python 3.13 PyTorch CUDA

License: MIT

ARGPrism is a deep learning-based pipeline for predicting and annotating Antibiotic Resistance Genes (ARGs) from protein sequences using transformer embeddings and neural networks.

Key Features

  • Deep Learning Classification: ProtAlbert transformer embeddings + neural network classifier
  • GPU Accelerated: Fast processing with CUDA support
  • Reference Mapping: DIAMOND BLAST alignment to ARG databases
  • Simple Interface: Easy-to-use command line tool
  • Flexible Deployment: CPU or GPU execution

Table of Contents

Installation

Prerequisites

  • Linux operating system (Ubuntu 20.04+)
  • Conda/Miniconda/Mamba (Recommended) must be installed
  • 8+ GB RAM (16 GB recommended)
  • NVIDIA GPU with CUDA 11.8+ or 12.x (optional, for acceleration)

Option 1: Install from Conda (Recommended)

# Install from conda-forge
mamba install -c bioconda argprism

# Verify installation
argprism --version

Option 2: Install from Source

# Clone repository
git clone https://github.com/haseebmanzur/ARGPrism.git
cd ARGprism

# Create environment
mamba env create -f environment.yml

# Activate environment  
mamba activate argprism

# Verify installation
argprism --version

Quick Start

# Activate environment
mamba activate argprism

# Run on test data
argprism Test_dataset/Test_data.faa --output-dir results/

Usage

Command Line

argprism INPUT_FILE.faa [OPTIONS]

Options

Option Description Default
-o, --output-dir Output directory argprism_output
--device Force CPU/CUDA usage Auto-detect
--quiet Reduce output verbosity False

Python API

from argprism import run_pipeline

# Run pipeline
result = run_pipeline(
    input_fasta="input.faa",
    output_dir="results/",
    verbose=True
)

print(f"Predictions: {len(result.predictions)}")
print(f"ARGs found: {result.predicted_fasta}")

Pipeline Overview

ARGPrism processes protein sequences through the following steps:

Input FASTA → ProtAlbert Embeddings → Neural Classifier → ARG Prediction → DIAMOND Mapping → Report

Process Details

  1. Embedding Generation: ProtAlbert generates 4096-dimensional embeddings
  2. Classification: Neural network predicts ARG/Non-ARG for each sequence
  3. Reference Mapping: DIAMOND aligns predicted ARGs to reference database
  4. Report Generation: Creates annotated CSV with ARG names and drug classes

Input/Output

Input

  • FASTA file: Protein sequences to analyze
  • Built-in models and databases are included

Output Files

All results saved to output directory:

  • predicted_ARGs.fasta - Sequences classified as ARGs
  • predicted_ARGs_vs_ref.tsv - DIAMOND alignment results
  • final_ARG_prediction_report.csv - Annotated predictions with ARG names/drugs
  • diamond_arg_db.dmnd - DIAMOND database index

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or support, please open an issue on GitHub.

Project PI: Dr. Masood Ur Rehman
Email: m.kayani@sines.nust.edu.pk

Author: Haseeb Manzoor
GitHub: @haseebmanzur

Package Maintainer: Muhammad Muneeb Nasir
GitHub: @muneebdev7

Acknowledgments

Citation

If you use ARGPrism in your research, please cite:

About

A deep learning-based pipeline for predicting and annotating Antibiotic Resistance Genes (ARGs) from protein sequences.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages

  • Python 100.0%