Initial pull request for the first prototyped rapid data generating pipeline #23

AidanMar · 2021-07-13T11:56:30Z

The primary focus should be on the files located within pipeline3. The other's were earlier iterations that can be ignored. This pipeline is a major overhaul of the original pipeline. The pipeline use built using snakemake which calls the files in the scripts directory to synthesise the data set. This version removes a lot of the unnecessary writing to disc, and ad hoc data structures and vectorises a lot of the processes that were using too many loops before. The heavy lifting is done by pandas and numpy.

The next steps are to process the data the pipeline has output with CNNs. In addition to this, the current pipeline should have more extensive, and cleaner commenting, along with documentation to ensure that it can be read by future users.

…and one for performing the downloads. The snakefile has been adapted to incorporate both of these steps into the procedure

…t expects directory output.

…eparate jobs to download cds, pep and gtf files. Each job uses the new downloader.sh script, using diffferent input arguments

…om snakefile

…eny matrix. Scripts used to use dir_name of data_homology for homology databases. I have renamed this to homology_databases and modified the python scripts to work with the new naming convention

…pfam matrices

…way up to the run process_negative.py scripts successfully

… been updated

…o. Date: Sat Jun 12 07:27:00 BST 2021

…ifferent databases and supply those download links

…at not all databases have the same sets of species available. New snakemake re-config and and adaptation of the ftpy.py script. The user should now run ftpy.py before running the snake file. After that point, the pipeline should be determinitistic

…s should be a list of the species which are in the intersection of all the different input databases

…jobs kept being submitted with too little memory. Rectified the snakefile to fix this

…a environment

Initial commit

remove long read output from the file

…rning

…ology types

AidanMar added 30 commits May 27, 2021 08:06

Snakemake pipeline initial commit

d7ca465

Download script split into two files, one for finding the file links …

fa3c90d

…and one for performing the downloads. The snakefile has been adapted to incorporate both of these steps into the procedure

Snakefile updated to include download of homology databases

df5555b

Remove useless tsv files

fb80e66

Alternative pipeline2 utilising wget to pre-requisite data downloads

089c903

Workable snakefile with wget downloads. Currently contains error as i…

bcd3f47

…t expects directory output.

Modified the downloading part of the pipeline. The snakefile does 3 s…

35c6217

…eparate jobs to download cds, pep and gtf files. Each job uses the new downloader.sh script, using diffferent input arguments

Initial Work with original create genome map scripts

02583c0

Edited create_genome_maps.py to take input_dir as arg. This is run fr…

b6263c0

…om snakefile

Added directed output to the create_genome_maps.py script

51a6584

Moved scripts to handle steps from create genome_maps to prepare synt…

8835aec

…eny matrix. Scripts used to use dir_name of data_homology for homology databases. I have renamed this to homology_databases and modified the python scripts to work with the new naming convention

snakefile updated to include all steps from select records to create …

a5e8be6

…pfam matrices

Final scripts all the way up to the finalize_dataset.py script

bd0dfc8

readme updated to show link to negative example datasets

73c1894

The hmmer-scan user guide added to the repo

c73e49b

update the scripts and snakefile which now runs from scratch all the …

4950fe3

…way up to the run process_negative.py scripts successfully

Updated notebook, and added the current dag.png to the repo

73e1a96

Updated project readme and added a new exlpanatory readme

ea33553

project_notebook.md updated

c6a84a2

pipeline2 README.md initial commit

ba489d0

Incorporate the latest pipeline2 dag visualisation into the README.md

aaf20c7

snakefile now includes all pre-processing steps and the dag image has…

ba5b424

… been updated

Explanatory document outlines the structure of the genome_maps

0d3838e

Final pipeline commit before switching to diamond

49de2cd

Add the list of all currently available homology databases to the rep…

7a9ba5e

…o. Date: Sat Jun 12 07:27:00 BST 2021

Updated the ftpy downloader to find the intersection species in the d…

25256fa

…ifferent databases and supply those download links

Project notebook updated

c2c6786

Pipeline Re-written to handle all a single input list of species. Thi…

db1f573

…s should be a list of the species which are in the intersection of all the different input databases

initial incorporation of Diamond into the workflow

3537e06

Aiden Marshall and others added 30 commits August 15, 2021 11:56

Sampling stage of the pipeline failed. This occurred as the sampling …

7532d84

…jobs kept being submitted with too little memory. Rectified the snakefile to fix this

The best performing model attaining 99% accuracy

aa82653

A complete yml file containing all packages necessary to run the cond…

bbb1f47

…a environment

Project README initial commit

316cc51

A small of the pipelines DAG

032dadb

Update README

8055a09

Update README

246a850

Readme update

cf3f757

initial commit ofmedium_dag.png

c01e325

README update

152ae15

dag.png

e1a2624

CNN_model.png initial commit

03ee727

README update

0a0dbb3

MLP_layers.png initial commit

4116c79

The readme now contains MLP illustration

2c34a1e

README update

27d3ab9

The latest iteration of the pipeline

5574678

config.json update

93696e5

Added primary descriptive data to the pipelin

04ddb05

initial commit of the valid species path

5cd695f

README with elaboration about matrix construction within the pipeline

927930c

Final functional end-to-end snakefile, with prepped for submission

4737a73

Initial commit of the hmm table

b0ac705

Valid_species with ids for the species list initial commit

c06e393

new_sampler.py

3c08c4f

Initial commit

Add files via upload

5d09b95

remove long read output from the file

pipeline checkpoint

e50b015

Merge branch 'master' of https://github.com/AidanMar/compara-deep-lea…

0fe138d

…rning

generate_homology_pairs.py initial commit

9b8a6ab

New version of the pipeline produce gene pairs with the corrected hom…

3f9440a

…ology types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial pull request for the first prototyped rapid data generating pipeline #23

Initial pull request for the first prototyped rapid data generating pipeline #23

Uh oh!

AidanMar commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Initial pull request for the first prototyped rapid data generating pipeline #23

Are you sure you want to change the base?

Initial pull request for the first prototyped rapid data generating pipeline #23

Uh oh!

Conversation

AidanMar commented Jul 13, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant