Request: Improved Documentation for Pipeline Input Files

Hello,

First off, huge kudos to everyone involved in the creation of this amazing tool — it's really impressive work! The reports it outputs are super slick and I can't wait to generate them for my own datasets.

In trying to get started, I would say that most of the documentation was quite helpful, except when it came to understanding the various input files that are needed to actually run the pipeline. For context, I am fairly familiar with Nextflow, Docker, the CLI, and Python, and am running LAAVA on AWS. And yes, I did read [this section of the documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs) covering the inputs.

At a high level, my biggest suggestions are to list all possible input files, describe each of their fields, indicate how the file and its fields relate to other files and their fields, wherever applicable, and explain how the presence or absence of the files (or their fields) affect the subsequent analysis and pipeline outputs.

---

To illustrate my confusion given the current documentation, here is a more specific set of questions that came up for me as I was trying to piece together how to get started with LAAVA. I know it's a lot, but if you can answer any of them, **especially the bolded questions**, that would be hugely appreciated!

1. The current repo includes 4 `params....json` files. Are there parameters beyond the ones listed in those files? If so, what are they are where are they described?
2. In some cases, the `params....json` files include a `seq_reads_folder` parameter but in other cases, they include a `seq_reads_file` parameter. Which should you use when? Where else should your configuration be adjusted if you use one vs the other?
3. All the params files included with the repo include `local` in their filename. What needs to change if you're running the pipeline in the cloud?
4. Some of the `params....json` files have a `sample_in_metadata` parameter. It seems like that parameter is potentially needed when you use the `seq_reads_folder` as a way to identify the samples. Is that accurate? What are the possible fields of the metadata file? The example file includes `sample_unique_id` and `sample_display_name`. How do those relate to the BAM or FASTQ filenames in the `seq_reads_folder` and where do they get used and/or output by the pipeline?
5. Other parameters that specify files include `vector_bed`, `vector_fa`, `packaging_fa`, `host_fa`, and `flipflop_fa`. Why are the `vector_bed` files all called `...annotation.bed`? Why are the `vector_fa` files all called `...construct.fasta`? Why do the `vector_fa` files have a `.fasta` extension while the `packaging_fa` files have a `.fa` extension? Why are the `vector_fa` files in the `test/samples` folder rather than the `test/fasta` folder?
6. **Do the rows/coordinates in the `...annotation.bed` file only pertain to the `...construct.fasta` file? Do any of those features need to be included in the `...packaging.fa` file?**
7. **I expected that adding features to the `...annotation.bed` file would be reflected in the resulting reports. For example, I am interested in gathering stats about how many reads include the various portions of the plasmid, but including the coordinates for those features in the annotation file seemingly had no impact on the outputs of the pipeline. What is the `...annotation.bed` file actually used for?**
8. If you look in the `test/samples` folder, there's also a `sc.reference_names.tsv` file. Does that file ever get used as an input? I'm not seeing any documentation about this file and it seems like it's an intermediate output of the pipeline when I run it.
9. Other parameters in the `params....json` files include `itr_label_1`, `itr_label_2`, `mitr_label`, `repcap_name`, `helper_name`, `flipflop_name`, `flipflop_fa`, `target_gap_threshold`, `max_allowed_outside_vector`, `max_allowed_missing_flanking`, and `container_version`. Where are these documented? Which ones are optional? What are the possible values of each and how is the analysis affected for each value? **What other files or file fields must be in sync with the values of these parameters?** (For example, presumably, the `repcap_name` parameter must match the name of the repcap sequence used in the `Packaging plasmids and other sequences` FASTA file — are there any other files/values to be aware of?)
10. Looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), it's not explicitly clear which row of the table corresponds to which parameter/input file. I was able to piece it together, but I'd recommend making it explicit.
11. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), what changes if you use a BED4 vs BED12 file for the `Vector genome annotation` and does this have any effect on the outputs of the pipeline?
12. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), how is the analysis affected by adding additional `Vector genome annotation` rows?
13. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), it looks like you expect one or more FASTA files which each have one or more records. Why not combine these into a single input file?
14. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), the `Packaging plasmids and other sequences` file is listed as optional. How does the analysis change if this file is or is not present and/or how does adding additional sequences to this file affect the outputs of the analysis?
15. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs), the `Expected Source` column of the table is confusing at best.
16. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs). Below the table, you have a section that starts with "Labels used to guide the handling of the input files above:". What do you mean by labels? Presumably, you're referring to some of the fields of the input files, but how do those fields guide the handling of the files exactly? Don't the files guide the interpretation of the fields/labels?
17. Again, looking at the [inputs documentation](https://github.com/formbio/laava/wiki/Design-and-definitions-(v3.x-releases)#inputs). In the documentation below the table, you frequently say that the labels "are case-sensitive and must match exactly." Match what exactly? Each other? Fields from other files? If so, which other files?

Anyway, there's plenty more I could ask but that should give a sense of the kinds of things I have been getting stuck on as I have been trying to get started. Any help here is definitely appreciated and will likely benefit other researchers. I'm also happy to share the details of my AWS setup if it can help anyone out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Improved Documentation for Pipeline Input Files #96

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request: Improved Documentation for Pipeline Input Files #96

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions