Word clouds were generated as follows:
- Gene to PubMed ID (PMID) associations were downloaded from NCBI (ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz);
- For each PMID, the number of occurrences of each word in the abstract of that PMID were retrieved using the R package PubMedWordcloud (punctuation and “stop words” were not considered);
- For each JASPAR taxon (except for urochordates), the inverse document frequency (IDF) of each word was computed as the number of transcription factors in that taxon over the number of transcription factors (in that taxon) associated with an abstract comprising that word;
- For each transcription factor, the term frequency-IDF (TF-IDF) of each word was calculated as the number of abstracts associated with that transcription factor comprising that word over the number of words associated with that transcription factors and multiplied by the taxon-specific IDF of that word; and
- Transcription factor word clouds were generated from the ranks of the top 50 words (in terms of TF-IDF). Note that words that resembled the gene name of the TF (or one of its synonyms) and redundant words (e.g. “insulator” and “insulator-binding”) were removed using the Python module fuzzywuzzy.
install.packages('PubMedWordcloud', repos='http://cran.us.r-project.org')