Skip to content

Request: Improved Documentation for Pipeline Input Files #96

@CharlesARoy

Description

@CharlesARoy

Hello,

First off, huge kudos to everyone involved in the creation of this amazing tool — it's really impressive work! The reports it outputs are super slick and I can't wait to generate them for my own datasets.

In trying to get started, I would say that most of the documentation was quite helpful, except when it came to understanding the various input files that are needed to actually run the pipeline. For context, I am fairly familiar with Nextflow, Docker, the CLI, and Python, and am running LAAVA on AWS. And yes, I did read this section of the documentation covering the inputs.

At a high level, my biggest suggestions are to list all possible input files, describe each of their fields, indicate how the file and its fields relate to other files and their fields, wherever applicable, and explain how the presence or absence of the files (or their fields) affect the subsequent analysis and pipeline outputs.


To illustrate my confusion given the current documentation, here is a more specific set of questions that came up for me as I was trying to piece together how to get started with LAAVA. I know it's a lot, but if you can answer any of them, especially the bolded questions, that would be hugely appreciated!

  1. The current repo includes 4 params....json files. Are there parameters beyond the ones listed in those files? If so, what are they are where are they described?
  2. In some cases, the params....json files include a seq_reads_folder parameter but in other cases, they include a seq_reads_file parameter. Which should you use when? Where else should your configuration be adjusted if you use one vs the other?
  3. All the params files included with the repo include local in their filename. What needs to change if you're running the pipeline in the cloud?
  4. Some of the params....json files have a sample_in_metadata parameter. It seems like that parameter is potentially needed when you use the seq_reads_folder as a way to identify the samples. Is that accurate? What are the possible fields of the metadata file? The example file includes sample_unique_id and sample_display_name. How do those relate to the BAM or FASTQ filenames in the seq_reads_folder and where do they get used and/or output by the pipeline?
  5. Other parameters that specify files include vector_bed, vector_fa, packaging_fa, host_fa, and flipflop_fa. Why are the vector_bed files all called ...annotation.bed? Why are the vector_fa files all called ...construct.fasta? Why do the vector_fa files have a .fasta extension while the packaging_fa files have a .fa extension? Why are the vector_fa files in the test/samples folder rather than the test/fasta folder?
  6. Do the rows/coordinates in the ...annotation.bed file only pertain to the ...construct.fasta file? Do any of those features need to be included in the ...packaging.fa file?
  7. I expected that adding features to the ...annotation.bed file would be reflected in the resulting reports. For example, I am interested in gathering stats about how many reads include the various portions of the plasmid, but including the coordinates for those features in the annotation file seemingly had no impact on the outputs of the pipeline. What is the ...annotation.bed file actually used for?
  8. If you look in the test/samples folder, there's also a sc.reference_names.tsv file. Does that file ever get used as an input? I'm not seeing any documentation about this file and it seems like it's an intermediate output of the pipeline when I run it.
  9. Other parameters in the params....json files include itr_label_1, itr_label_2, mitr_label, repcap_name, helper_name, flipflop_name, flipflop_fa, target_gap_threshold, max_allowed_outside_vector, max_allowed_missing_flanking, and container_version. Where are these documented? Which ones are optional? What are the possible values of each and how is the analysis affected for each value? What other files or file fields must be in sync with the values of these parameters? (For example, presumably, the repcap_name parameter must match the name of the repcap sequence used in the Packaging plasmids and other sequences FASTA file — are there any other files/values to be aware of?)
  10. Looking at the inputs documentation, it's not explicitly clear which row of the table corresponds to which parameter/input file. I was able to piece it together, but I'd recommend making it explicit.
  11. Again, looking at the inputs documentation, what changes if you use a BED4 vs BED12 file for the Vector genome annotation and does this have any effect on the outputs of the pipeline?
  12. Again, looking at the inputs documentation, how is the analysis affected by adding additional Vector genome annotation rows?
  13. Again, looking at the inputs documentation, it looks like you expect one or more FASTA files which each have one or more records. Why not combine these into a single input file?
  14. Again, looking at the inputs documentation, the Packaging plasmids and other sequences file is listed as optional. How does the analysis change if this file is or is not present and/or how does adding additional sequences to this file affect the outputs of the analysis?
  15. Again, looking at the inputs documentation, the Expected Source column of the table is confusing at best.
  16. Again, looking at the inputs documentation. Below the table, you have a section that starts with "Labels used to guide the handling of the input files above:". What do you mean by labels? Presumably, you're referring to some of the fields of the input files, but how do those fields guide the handling of the files exactly? Don't the files guide the interpretation of the fields/labels?
  17. Again, looking at the inputs documentation. In the documentation below the table, you frequently say that the labels "are case-sensitive and must match exactly." Match what exactly? Each other? Fields from other files? If so, which other files?

Anyway, there's plenty more I could ask but that should give a sense of the kinds of things I have been getting stuck on as I have been trying to get started. Any help here is definitely appreciated and will likely benefit other researchers. I'm also happy to share the details of my AWS setup if it can help anyone out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions