Skip to content

zihengc/GEO_ENCODE_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

###################################################################################### crawler: Python package for retrieving metadata and downloading datasets from ENCODE/GEO and conducting natural language proecessing on the metadata ######################################################################################

The program was tested on Python 3.7 +

Link to the demonstration (need andrew id to access the google drive): https://drive.google.com/file/d/18-O1GkruHvokf-TW6kR-EMK2sE4Cljvn/view?usp=sharing

To run the program, firstly cd the program folder and run "python setup.py" to install dependent packages: "numpy","matplotlib","pandas","requests","beautifulsoup4","download","biopython","tabulate","tqdm", "requests_ftp","nltk","networkx", "scikit-learn", "wordcloud"

The TextRank function depends on "glove.6B.100d.txt" file, so please check if the file is in the program folder.

To retrieve genomic data on ENCODE run: 1. "python main.py enquire inquire" 2. input a keyword of interest

To retrieve genomic data on GEO run: 1. "python main.py geo inquire" 2. Input a keyword of interest

To download genomic data on ENCODE run: 1. "python main.py encode download" 2. input a experiment ID of interest

To download genomic data on ENCODE run: 1. "python main.py geo download" 2. Input "geos_keyword.csv", in which the csv file is from the program folder

To perform natural language processing on GEO entry titles: To rank text: 1. "python main.py nls textrank" 2. Input "geos_keyword.csv", in which the csv file is from the program folder; It takes a long time to run if the csv file has several thousand entries

To generate WordCloud:
    1. "python main.py nls wordcloud"
    2. Input "geos_keyword.csv", in which the csv file is from the program folder

To generate topics:
    1. "python main.py nls lda"
    2. Input "geos_keyword.csv", in which the csv file is from the program folder

About

Genomic data crawler and NLP analyses for GEO and ENCODE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages