Skip to content

regulomics/mgplot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

MGPLOT

MGPLOT is a script for visualizing metagenomic data. The script generates average profiles and heatmaps of selected genomic features based on a provided bedgraph file.

Dependencies


  1. Python 3.6
  2. NumPy (tested on 1.12.1)
  3. SciPy (tested on 0.19.0)
  4. Seaborn (tested on 0.7.1)

Usage


  1. Input Options

  2. Plot Options

  3. Region Declaration

  4. Output

  5. Detailed Feature Selection

Input Options

Argument Description Default Values
-g --gfile bed or tsv file containing information about the genomic features None file name or path
-i --infile bedgraph file None file name or path
-gt
--gfiletype
type of the gfile 'bed' 'bed' , 'scoretsv'
-cf
--configFile
configuration file None file name or path
-gi --gindx indicies representing the location of fields in a bed file 0,1,2,3,-1 sequence of five integers
-rp --replot numpy binary file containing a matrix to be replotted None file name or path

Input files cannot have headers.

Console arguments override configuration file arguments

For the configuration file to be read correctly it should have only two columns separated by TAB. In each row the first column is the argument, the second is the value. When an argument can take multiple values such as in --chrmomit and -gomit the values should be in the second column separated by commas. Example of a correctly formatted configuration file:

region      genebody
plottype    both
nticks      2
sort        true
chrmonly    chr1,chr2,chrX
scorerange  1000,10000

The --gindx sequence indicates in which columns in the .bed file the script will find information about chromosome name, start position, end position, feature name, strand respectively.

--replot allows you to plot previously saved data (using --matfile) without reading other input files again, with different plot settings. When using replot the region and flank arguments must remain unchanged.

Plot Options

Argument Description Default Values
-p
--plottype
type of plot to be generated 'avgprof' 'avgprof' , 'heatmap' , 'both'
-nt
--nticks
number of ticks on the x-axis 1 a non negative integer
-cm --cmap colormap to be used when generating a heatmap 'Reds' matplotlib colormap name
-s --sort whether to sort the matrix used for generating a heatmap False no value in console, 'true' or 'false' in config file
-sm
--smooth
whether to smooth the average profile curve before ploting False boolean or a non negative float
-hn --hnorm whether to use linear or symmetric logarithmic normalization for the colorscale 'lin' 'lin' , 'log'
-ht
--hmtitle
title of the heatmap None string
-at
--avgtitle
title of the average profile None string
-cb
--cbar
whether to show a colorbar next to the heatmap False no value in console,
'true' or 'false' in config file

--nticks determines how many ticks surround the plotted region. For TSS and TSE nticks is the number of ticks on each side of the region of interest. For genebody it is the number of ticks left of TSS, between TSS and TSE (nticks-1), and right of TSE. For example a run with argument --nticks 2 will generate plots with 5 ticks for TSS and TSE, and 7 ticks for genebody.

When --smooth is set to true the curve is smoothed with a spline with a smoothing parameter equal to flank_length * 1e-4. To change the value of the smoothing parameter set the value of --smooth to a float.

Example average profile with --smooth set to 0.08: alt text

The same average profile with --smooth set to 0: alt text

Matplotlib colormap names can be found here.

--hnorm allows you to use a logarithmic normalization when linear normalization generates a poorly visable heatmap.

Example heatmap with --hnorm set to 'lin': alt text

The same heatmap with --hnorm set to 'log': alt text

Region Declaration

Argument Description Default Values
-r --region location of the genomic regions to be plotted 'TSS' 'TSS' , 'TSE' , 'genebody'
-fl --flank length of the flanking regions 1000 positive integer

The script will generate plots of the selected location with surrounding flanking regions.

When --region is set to genebody the regions will be normalized with a cubic spline to have length equal to the value of --flank.

Output

Argument Description Default Values
-oa --avgfile average profile output file None file name or path
-oh --hmfile heatmap output file None file name or path
-om --matfile matrix output file None file name or path

The format of the output file will be deduced from the extension of the provided filename. The supported formats depend on the matplotlib backend you're using. Most backends support png, pdf, ps, eps and svg. If a output filename is provided the corresponding plot will not be displayed.

The --matfile argument allows you to save a matrix for replotting data without reading other input files again. The matrix is saved in a numpy binary file. A .npy extension will be appended if the provided filename does not have one.

Detailed Feature Selection

Argument Description Default Values
-co
--chrmomit
chromosomes to be omitted None chromosome names
-only
--chrmonly
chromosomes to be considered exclusively None chromosome names
--go
--gomit
genomic regions to be omitted None first characters of feature names
-nb
--nbest
number of regions with the best scores to be considered (tsv files) None positive integer
-sr
--scorerange
range of score values from which features should be considered (tsv file) None two numbers seperated by a comma
-of
--ofirst
whether to use only the first occurence of a region in a bed file False no value in console,
'true' or 'false' in config file

The --gomit argument makes the script omit all genomic features that begin with one of the given strings.

The --nbest argument selects genes after chromosomes and genes are excluded.

When using --scorerange the first value should be smaller than the other. --scorerange overrides --nbest

Examples


Example 1

The configuration file (example1.cfg):

gfile		RNAseq_counts.tsv
infile		H3K4me3.bedgraph
gfiletype	scoretsv
flank 		2000
region 		TSS
nticks		2
plottype 	both
smooth		true
matfile		out1.npy
hnorm		log
sort		false
nbest		2000

Running the script will generate an average profile and heatmap of H3K4me3 enrichment around the transcripton start site on the first 2000 most expressed genes in the tsv file. The matrix will be saved for replotting as out1.npy.

python mgplot.py -cf example1.cfg

The result:

alt_text

alt_text

Replotting the heatmap with --sort set to True:

python mgplot.py -cf example1.cfg -rp out1.npy --sort

alt_text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%