The GRN-Pipeline

This project is about downloading, parsing and combining different Gene Regulatory Networks(GRN) or Gene Interaction Networks from three different public availbale-databases for GRNs. Is uses the following databases:

Motivation

In another coding project at our research group at the TUM, we were investigating how to build more meaningful SNP-SNP interaction models. In the need to find other information sources to improve the accuracy of th SNP-SNP-interaction predicting model. I have build a data-pipeline to download predicted GRNs from the mentioned databases. Due to some missing APIs, I needed to web-scrape all the needed single URLs and then parse them in the right format to feed the SNP-SNP interaction model. Another idea was to build a "general" GRN to construct a baseline.

Installation/Setup

Use the package manager pip3 to install the packages. Python3 version: python@3.9

#Recommendation: Use a local environment and ensure that the local environment is actived
python3 -m venv venv
source env/bin/activate

pip3 install --upgrade cython
pip3 install -r requirements.txt

Examples

These examples are to demonstrate the pipeline directly from the terminal. Therefore, I provided short constructed input files. If you want to download all the GRNs and conduct transformation and analysis, please use the files in the resource folder, which contain all the scrapped links to the datasets.

#Download to two GRNS
python3 cli.py download example_download_links    

#Formatting of the GRNs
python3 cli.py format example_dataset_list grndb 

#Create the union of the GRNs
python3 cli.py union example_cleaned_datasets 

# Complete data pipeline from Download to Concatenating
python3 cli.py complete example_download_links grndb 

#Visualizing the data
python3 cli.py visualize final_GRN.txt plot

Usage

Downloading the GRNs from the specific databases and transform the data sets to the right format for the SNP-SNP-interaction project.

Downloading the specifc Dataset or GRNs

#Runs the script with an input-file which contains all of the links
python3 cli.py download <input_file> <output_folder>

input_file: defines the path to the input file. The file contains per line one link to a file/dataset/GRN
out_folder: optional argument to provide a path to store the dowloaded datasets

Formatting the downloaded Dataset

In the following step, the downloaded GRNs are parsed in the right format. Due to inconsistent use of different IDs at the databases. I used the biodbnet-API to map all different IDs for Genes to their Gene Symbol and omitted every other information of the datasets. Due to the lack of scale at the API, every big request needed tp be batched. In addition, in the GRAND-database, the GRNs were only provided in the adjacency format, so they needed a special parsing.

#Conversion from one ID to another
python3 cli.py format <input_file> <input_db> <output_folder>

input_file: every line in the file should store a file-path to a dataset
input_db: defines the name of database to tell the program the input dataset format !!!Please use only the following arguments for the respective database:'grndb', 'grand', 'humanbase'!!!
output_folder: defines an optional path if the formatted datasets should be stored in the working directory

Build Union of Datasets

To build a "general GRN", I provided the functionality to build a union between several datasets. It is important that the datasets were formated before that step.

#Conversion from one ID to another
python3 cli.py union <input_file> <output_file>

input_file: every line in the file should store a file-path to a 'formatted' dataset
output_file: defines the name of the finished output_file

Complete Pipeline

Conducts every single step sequentially.

#Conversion from one ID to another
python3 cli.py complete <input_file> <input_db> <output_folder>

input_file: every line in the file should store a file-path to a 'formatted' dataset
input_db: defines the name of database to tell the program the input dataset format !!!Please use only the following arguments for the respective database:'grndb', 'grand', 'humanbase'!!!
output_file: defines the name of the finished output_file

Visualize final GRN

A lot of networks in realty do not follow a normal distribution. This applies also to the degree GRN's degree distribution. Instead, if we are plotting the degree distribution of the nodes (genes) in a GRN, it should follow a power law distribution. So that is the reason why I build the visualization feature. The visualization feature enables the possibility to plot the degree distribution of the nodes, the log-log-transformed degree distribution, and the summary-statistics feature. In the characteristics feature, I conducted a log-log-transformation and tested if it fits an OLS regression.

#Conducts degree distribution
python3 cli.py visualize <input_file> plot

#Conducts log-log-transformed degree distribution
python3 cli.py visualize <input_file> log_plot

#Prints fitting characteristics
python3 cli.py visualize <input_file> fitting_summary

input_file: it should be a file, which was formatted before this step !!!Only use plot, log_plot and fit_summary!!!

Example formatted File

All the output files should have the same format: **Headerline with TF & Gene, then gene represented by its ID \t next ID\n

TF	Gene
ARID3A	ARID3A
ARID3A	PLA2G15
ARNT	RORA
.   
.

Project-Structure

data_pipline-🗂: this folder serves cache for all the single datasets, which were downloaded
resources-🗂: contains already scrapped links to all the dataset from the three databases
cli.py: contains the command line tool interface
example_*.txt: raw files to demontrate the example
test -🗂: contains some tests

Tasks

Downloading files
Concatenating all datasets
Works with GRNdb, HumanBase & GRAND
Mapping between HGNC_ID, Ensemble and GeneSymbols
EntrezID is missing in the mapping part
Optimization for large amount of data
Works with adjacency matrices
Extension for Mapping: At the moment it only executes API calls to biodbnet, you could extend it, if outliers occur that it could be checked in the genenames-database

Contact

If you have any questions pls, do not hesitate to contact me! :)

Name	Email
Paul Wissenberg	paul.wissenberg@tum.de

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
data_pipeline		data_pipeline
resources		resources
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
example_cleaned_datasets		example_cleaned_datasets
example_cleaned_format_grndb_dataset_1.txt		example_cleaned_format_grndb_dataset_1.txt
example_cleaned_format_grndb_dataset_2.txt		example_cleaned_format_grndb_dataset_2.txt
example_dataset_list		example_dataset_list
example_download_links		example_download_links
example_format_grndb_dataset_1.txt		example_format_grndb_dataset_1.txt
example_format_grndb_dataset_2.txt		example_format_grndb_dataset_2.txt
final_GRN.txt		final_GRN.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The GRN-Pipeline

Motivation

Installation/Setup

Examples

Usage

Downloading the specifc Dataset or GRNs

Formatting the downloaded Dataset

Build Union of Datasets

Complete Pipeline

Visualize final GRN

Example formatted File

Project-Structure

Tasks

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The GRN-Pipeline

Motivation

Installation/Setup

Examples

Usage

Downloading the specifc Dataset or GRNs

Formatting the downloaded Dataset

Build Union of Datasets

Complete Pipeline

Visualize final GRN

Example formatted File

Project-Structure

Tasks

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages