This project is about downloading, parsing and combining different Gene Regulatory Networks(GRN) or Gene Interaction Networks from three different public availbale-databases for GRNs. Is uses the following databases:
In another coding project at our research group at the TUM, we were investigating how to build more meaningful SNP-SNP interaction models. In the need to find other information sources to improve the accuracy of th SNP-SNP-interaction predicting model. I have build a data-pipeline to download predicted GRNs from the mentioned databases. Due to some missing APIs, I needed to web-scrape all the needed single URLs and then parse them in the right format to feed the SNP-SNP interaction model. Another idea was to build a "general" GRN to construct a baseline.
Use the package manager pip3 to install the packages. Python3 version: python@3.9
#Recommendation: Use a local environment and ensure that the local environment is actived
python3 -m venv venv
source env/bin/activate
pip3 install --upgrade cython
pip3 install -r requirements.txtThese examples are to demonstrate the pipeline directly from the terminal. Therefore, I provided short constructed input files. If you want to download all the GRNs and conduct transformation and analysis, please use the files in the resource folder, which contain all the scrapped links to the datasets.
#Download to two GRNS
python3 cli.py download example_download_links
#Formatting of the GRNs
python3 cli.py format example_dataset_list grndb
#Create the union of the GRNs
python3 cli.py union example_cleaned_datasets
# Complete data pipeline from Download to Concatenating
python3 cli.py complete example_download_links grndb
#Visualizing the data
python3 cli.py visualize final_GRN.txt plot Downloading the GRNs from the specific databases and transform the data sets to the right format for the SNP-SNP-interaction project.
#Runs the script with an input-file which contains all of the links
python3 cli.py download <input_file> <output_folder>- input_file: defines the path to the input file. The file contains per line one link to a file/dataset/GRN
- out_folder: optional argument to provide a path to store the dowloaded datasets
In the following step, the downloaded GRNs are parsed in the right format. Due to inconsistent use of different IDs at the databases. I used the biodbnet-API to map all different IDs for Genes to their Gene Symbol and omitted every other information of the datasets. Due to the lack of scale at the API, every big request needed tp be batched. In addition, in the GRAND-database, the GRNs were only provided in the adjacency format, so they needed a special parsing.
#Conversion from one ID to another
python3 cli.py format <input_file> <input_db> <output_folder>- input_file: every line in the file should store a file-path to a dataset
- input_db: defines the name of database to tell the program the input dataset format !!!Please use only the following arguments for the respective database:'grndb', 'grand', 'humanbase'!!!
- output_folder: defines an optional path if the formatted datasets should be stored in the working directory
To build a "general GRN", I provided the functionality to build a union between several datasets. It is important that the datasets were formated before that step.
#Conversion from one ID to another
python3 cli.py union <input_file> <output_file>- input_file: every line in the file should store a file-path to a 'formatted' dataset
- output_file: defines the name of the finished output_file
Conducts every single step sequentially.
#Conversion from one ID to another
python3 cli.py complete <input_file> <input_db> <output_folder>- input_file: every line in the file should store a file-path to a 'formatted' dataset
- input_db: defines the name of database to tell the program the input dataset format !!!Please use only the following arguments for the respective database:'grndb', 'grand', 'humanbase'!!!
- output_file: defines the name of the finished output_file
A lot of networks in realty do not follow a normal distribution. This applies also to the degree GRN's degree distribution. Instead, if we are plotting the degree distribution of the nodes (genes) in a GRN, it should follow a power law distribution. So that is the reason why I build the visualization feature. The visualization feature enables the possibility to plot the degree distribution of the nodes, the log-log-transformed degree distribution, and the summary-statistics feature. In the characteristics feature, I conducted a log-log-transformation and tested if it fits an OLS regression.
#Conducts degree distribution
python3 cli.py visualize <input_file> plot
#Conducts log-log-transformed degree distribution
python3 cli.py visualize <input_file> log_plot
#Prints fitting characteristics
python3 cli.py visualize <input_file> fitting_summary- input_file: it should be a file, which was formatted before this step !!!Only use plot, log_plot and fit_summary!!!
All the output files should have the same format: **Headerline with TF & Gene, then gene represented by its ID \t next ID\n
TF Gene
ARID3A ARID3A
ARID3A PLA2G15
ARNT RORA
.
.
- data_pipline-🗂: this folder serves cache for all the single datasets, which were downloaded
- resources-🗂: contains already scrapped links to all the dataset from the three databases
- cli.py: contains the command line tool interface
- example_*.txt: raw files to demontrate the example
- test -🗂: contains some tests
- Downloading files
- Concatenating all datasets
- Works with GRNdb, HumanBase & GRAND
- Mapping between HGNC_ID, Ensemble and GeneSymbols
- EntrezID is missing in the mapping part
- Optimization for large amount of data
- Works with adjacency matrices
- Extension for Mapping: At the moment it only executes API calls to biodbnet, you could extend it, if outliers occur that it could be checked in the genenames-database
If you have any questions pls, do not hesitate to contact me! :)
| Name | |
|---|---|
| Paul Wissenberg | paul.wissenberg@tum.de |