Skip to content

Latest commit

 

History

History
42 lines (30 loc) · 1.95 KB

File metadata and controls

42 lines (30 loc) · 1.95 KB

###################################################################################### crawler: Python package for retrieving metadata and downloading datasets from ENCODE/GEO and conducting natural language proecessing on the metadata ######################################################################################

The program was tested on Python 3.7 +

Link to the demonstration (need andrew id to access the google drive): https://drive.google.com/file/d/18-O1GkruHvokf-TW6kR-EMK2sE4Cljvn/view?usp=sharing

To run the program, firstly cd the program folder and run "python setup.py" to install dependent packages: "numpy","matplotlib","pandas","requests","beautifulsoup4","download","biopython","tabulate","tqdm", "requests_ftp","nltk","networkx", "scikit-learn", "wordcloud"

The TextRank function depends on "glove.6B.100d.txt" file, so please check if the file is in the program folder.

To retrieve genomic data on ENCODE run: 1. "python main.py enquire inquire" 2. input a keyword of interest

To retrieve genomic data on GEO run: 1. "python main.py geo inquire" 2. Input a keyword of interest

To download genomic data on ENCODE run: 1. "python main.py encode download" 2. input a experiment ID of interest

To download genomic data on ENCODE run: 1. "python main.py geo download" 2. Input "geos_keyword.csv", in which the csv file is from the program folder

To perform natural language processing on GEO entry titles: To rank text: 1. "python main.py nls textrank" 2. Input "geos_keyword.csv", in which the csv file is from the program folder; It takes a long time to run if the csv file has several thousand entries

To generate WordCloud:
    1. "python main.py nls wordcloud"
    2. Input "geos_keyword.csv", in which the csv file is from the program folder

To generate topics:
    1. "python main.py nls lda"
    2. Input "geos_keyword.csv", in which the csv file is from the program folder