-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Hello,
First off, huge kudos to everyone involved in the creation of this amazing tool — it's really impressive work! The reports it outputs are super slick and I can't wait to generate them for my own datasets.
In trying to get started, I would say that most of the documentation was quite helpful, except when it came to understanding the various input files that are needed to actually run the pipeline. For context, I am fairly familiar with Nextflow, Docker, the CLI, and Python, and am running LAAVA on AWS. And yes, I did read this section of the documentation covering the inputs.
At a high level, my biggest suggestions are to list all possible input files, describe each of their fields, indicate how the file and its fields relate to other files and their fields, wherever applicable, and explain how the presence or absence of the files (or their fields) affect the subsequent analysis and pipeline outputs.
To illustrate my confusion given the current documentation, here is a more specific set of questions that came up for me as I was trying to piece together how to get started with LAAVA. I know it's a lot, but if you can answer any of them, especially the bolded questions, that would be hugely appreciated!
- The current repo includes 4
params....jsonfiles. Are there parameters beyond the ones listed in those files? If so, what are they are where are they described? - In some cases, the
params....jsonfiles include aseq_reads_folderparameter but in other cases, they include aseq_reads_fileparameter. Which should you use when? Where else should your configuration be adjusted if you use one vs the other? - All the params files included with the repo include
localin their filename. What needs to change if you're running the pipeline in the cloud? - Some of the
params....jsonfiles have asample_in_metadataparameter. It seems like that parameter is potentially needed when you use theseq_reads_folderas a way to identify the samples. Is that accurate? What are the possible fields of the metadata file? The example file includessample_unique_idandsample_display_name. How do those relate to the BAM or FASTQ filenames in theseq_reads_folderand where do they get used and/or output by the pipeline? - Other parameters that specify files include
vector_bed,vector_fa,packaging_fa,host_fa, andflipflop_fa. Why are thevector_bedfiles all called...annotation.bed? Why are thevector_fafiles all called...construct.fasta? Why do thevector_fafiles have a.fastaextension while thepackaging_fafiles have a.faextension? Why are thevector_fafiles in thetest/samplesfolder rather than thetest/fastafolder? - Do the rows/coordinates in the
...annotation.bedfile only pertain to the...construct.fastafile? Do any of those features need to be included in the...packaging.fafile? - I expected that adding features to the
...annotation.bedfile would be reflected in the resulting reports. For example, I am interested in gathering stats about how many reads include the various portions of the plasmid, but including the coordinates for those features in the annotation file seemingly had no impact on the outputs of the pipeline. What is the...annotation.bedfile actually used for? - If you look in the
test/samplesfolder, there's also asc.reference_names.tsvfile. Does that file ever get used as an input? I'm not seeing any documentation about this file and it seems like it's an intermediate output of the pipeline when I run it. - Other parameters in the
params....jsonfiles includeitr_label_1,itr_label_2,mitr_label,repcap_name,helper_name,flipflop_name,flipflop_fa,target_gap_threshold,max_allowed_outside_vector,max_allowed_missing_flanking, andcontainer_version. Where are these documented? Which ones are optional? What are the possible values of each and how is the analysis affected for each value? What other files or file fields must be in sync with the values of these parameters? (For example, presumably, therepcap_nameparameter must match the name of the repcap sequence used in thePackaging plasmids and other sequencesFASTA file — are there any other files/values to be aware of?) - Looking at the inputs documentation, it's not explicitly clear which row of the table corresponds to which parameter/input file. I was able to piece it together, but I'd recommend making it explicit.
- Again, looking at the inputs documentation, what changes if you use a BED4 vs BED12 file for the
Vector genome annotationand does this have any effect on the outputs of the pipeline? - Again, looking at the inputs documentation, how is the analysis affected by adding additional
Vector genome annotationrows? - Again, looking at the inputs documentation, it looks like you expect one or more FASTA files which each have one or more records. Why not combine these into a single input file?
- Again, looking at the inputs documentation, the
Packaging plasmids and other sequencesfile is listed as optional. How does the analysis change if this file is or is not present and/or how does adding additional sequences to this file affect the outputs of the analysis? - Again, looking at the inputs documentation, the
Expected Sourcecolumn of the table is confusing at best. - Again, looking at the inputs documentation. Below the table, you have a section that starts with "Labels used to guide the handling of the input files above:". What do you mean by labels? Presumably, you're referring to some of the fields of the input files, but how do those fields guide the handling of the files exactly? Don't the files guide the interpretation of the fields/labels?
- Again, looking at the inputs documentation. In the documentation below the table, you frequently say that the labels "are case-sensitive and must match exactly." Match what exactly? Each other? Fields from other files? If so, which other files?
Anyway, there's plenty more I could ask but that should give a sense of the kinds of things I have been getting stuck on as I have been trying to get started. Any help here is definitely appreciated and will likely benefit other researchers. I'm also happy to share the details of my AWS setup if it can help anyone out.