Skip to content

bnjenner/impaqt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

155 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Impaqt C/C++ CI

Currently performs transcript identification, gene assignment, and naive quantification. A more sophisticated quantification method remain on the TO-DO list.

Introduction

IMPAQT (Identifies Multiple Peaks and Quantifies Transcripts) is a transcript identification and gene expression quantification method for TAGseq and 3' mRNAseq experiments. It operates on assumptions about the distribution of reads along the 3' UTR of expressed genes. Clustering these reads enables pseudo-transcript identification and quantification of expression at the gene and transcript level for isoforms utilizing distinct 3' ends.

It generates a GTF file defining the boundaries of each transcript and their expression level as well as, optionally, a gene expression counts table if a reference annotation is provided.

This method is particularly useful in non-model organisms where 3' UTRs for most genes are poorly annotated (resulting in massive data loss). Increased gene density also tends to hurt the assignment of transcripts by this aglorithm, as it increases assignment ambiguity. Reads for which a reasonable transcript of origin cannot be identified are handled individually.

Installation

  1. Make sure cmake and make are installed on your machine.
# Linux
sudo apt install cmake zlib1g-dev

# Mac
brew install cmake zlib
  1. Clone this repository and change into it.
git clone https://github.com/bnjenner/impaqt.git
cd impaqt
  1. Create a build directory and change into it.
mkdir build
cd build
  1. Compile
cmake ../
make
  1. Install
sudo make install
  1. Give it a go!
impaqt input.sorted.bam

Usage

SYNOPSIS
    impaqt input.sorted.bam [options]

DESCRIPTION
    Identifies Multiple Peaks and Qauntifies Transcripts. Identifies and quantifies isoforms utilizing distinct 3'
    ends. Generates a GTF file of identified transcripts and optionally a counts file written to stdout if a reference
    annotation is provided.

REQUIRED ARGUMENTS
    BAM INPUT_FILE

OPTIONS
    -h, --help
          Display the help message.
    -t, --threads INTEGER
          Number of processers for multithreading. Default: 1.
    -a, --annotation INPUT_FILE
          Annotation File (GTF or GFF). If specified, a counts table will be output through standard out. NOTICE: File
          type identified by file extension. Default: .
    -s, --strandedness STRING
          Strandedness of library. One of forward and reverse. Default: forward.
    -n, --nonunique-alignments
          Count primary and secondary read alignments.
    -q, --mapq-min INTEGER
          Minimum mapping quality score to consider. Default: 1.
    -w, --window-size INTEGER
          Window size to use to parition genome for read collection. Default: 1000.
    -m, --min-count INTEGER
          Minimum read count to initiate DBSCAN transcript identification algorithm. (Hard minimum of 10) Default: 25.
    -p, --count-percentage INTEGER
          Minimum read count percentage for identifying core reads in DBSCAN algorithm. This will be the threshold
          unless number of reads is less than 10. Default: 5.
    -e, --epsilon INTEGER
          Distance (in base pairs) for neighboring reads in DBSCAN algorithm. This should generally be 0.5-1.5x the
          read length, depending on desired isoform sensitivity (lower = more sensitive). Default: 50.
    -d, --density-threshold DOUBLE
          Read density threshold (# reads / # bps) to skip transcript identification. Assignment in super dense
          regions (usually the mitochrondria) doesn't really benefit from transcript identificaiton. Default is unset.
          Default: 0.
    -f, --feature-tag STRING
          Name of feature in GTF for assignment. Default: exon.
     -u, --utr-tag STRING
          Name of UTR feature in GTF for assignment. Default: UTR.
    -i, --feature-id STRING
          ID of feature to use for feature assignment. Default: gene_id.
    -o, --output-gtf STRING
          Specify name of cluster GTF file. Default is BAM name + ".gtf".
    --version
          Display version information.

VERSION
    Last update: August 2025
    impaqt version: beta
    SeqAn version: 2.4.0

Dependencies

Utilizes libraries like bamtools and seqan and the DBSCAN algorithm is inspired by EmbeddedArtistry and github user Eleobert.

Contact

For questions or comments, please contact Bradley Jenner at bnjenner@bu.edu

About

Identifies Multiple Peaks and Qauntifies Transcripts. Quantifies gene expression from TAGseq experiments by identifying transcript isoforms containing distinct 3' UTRs/terminal exons.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors