Added output/input comparison file#6
Added output/input comparison file#6lyla-kumari wants to merge 6 commits intodrtconway:developmentfrom
Conversation
|
Tested missing variants in input files and output files separately (without running svelt again) Haven't done coverage testing yet |
compare_svelt_output_to_inputs.py
Outdated
| for rec in vcf: | ||
| original_ids = rec.info.get("ORIGINAL_IDS") | ||
| if not original_ids: | ||
| continue |
There was a problem hiding this comment.
If a row in the output of svelt does not contain ORIGINAL_IDS, that is an error in the output, so the script should complain, even if it keeps going. I'd consider writing out rec to stdout or you could even make a file containing the problem records.
compare_svelt_output_to_inputs.py
Outdated
| ids = set() | ||
| for rec in vcf: | ||
| if rec.id and rec.id != ".": | ||
| clean_id = rec.id.split('_', 1)[-1] if '_' in rec.id else rec.id |
There was a problem hiding this comment.
I'm pretty worried about this line. If it's not empty (Does pysam return "." or None for absent IDs?), we shouldn't be modifying it, otherwise it might collide with other IDs in this file!
Can you explain why this line is there? There may be an excellent reason, but I can't see it.
compare_svelt_output_to_inputs.py
Outdated
| for rec in vcf: | ||
| if rec.id and rec.id != ".": | ||
| clean_id = rec.id.split('_', 1)[-1] if '_' in rec.id else rec.id | ||
| ids.add(clean_id) |
There was a problem hiding this comment.
It would be a very good idea to check to see if the ID is in the set already, so you can produce a warning about duplicate IDs in the input.
compare_svelt_output_to_inputs.py
Outdated
|
|
||
| def extract_input_ids(input_files): | ||
| sample_to_ids = {} | ||
| all_ids = set() |
There was a problem hiding this comment.
IDs are not guaranteed to be unique between different VCFs. So the same ID value could refer to two different variants in two VCF files. Accordingly, I don''t understand why you make a set of all the IDs in the inputs.
compare_svelt_output_to_inputs.py
Outdated
| all_ids = set() | ||
| for path in input_files: | ||
| sample = os.path.basename(path).split('.')[0] | ||
| vcf = pysam.VariantFile(path) |
There was a problem hiding this comment.
I would consider using a with statement here with pysam.VariantFile(path) as vcf: to make sure the file gets closed. With the 3 samples we're using it's not going to be a problem, but it's a good defensive coding practice to use, to make sure that you don't get strange errors from running out of file descriptors.
If you're not familiar with with statements, I can explain them, or you can read up on them.
compare_svelt_output_to_inputs.py
Outdated
| import os | ||
| from collections import defaultdict | ||
|
|
||
| def extract_input_ids(input_files): |
There was a problem hiding this comment.
I would usually organise the code slightly differently, and make a function for reading the IDs from 1 file, and then either do the loop in the caller, or add an extra function. A function that does something with a file kind of makes logical sense, so helps guide the reader through the code.
Cargo.toml
Outdated
| [package] | ||
| name = "svelt" | ||
| version = "0.1.14" | ||
| version = "0.1.15" |
There was a problem hiding this comment.
This PR isn''t changing svelt itself, so we don't really need a version bump.
compare_svelt_output_to_inputs.py
Outdated
| original_ids = rec.info.get("ORIGINAL_IDS") | ||
| if not original_ids: | ||
| continue | ||
| if isinstance(original_ids, (list, tuple)): |
There was a problem hiding this comment.
This code is wrong, because it loses track of which ID came from which original file.
If you've got a bare string, then split on ,, then whether it was originally a list or tuple, or you split it, you want to iterate and put the ID into an appropriate corresponding set.
So you should return a list of sets corresponding to the original files. Does that make sense?
compare_svelt_output_to_inputs.py
Outdated
| svelt_ids.update(oid for oid in original_ids.split(',') if oid != '.') | ||
| return svelt_ids | ||
|
|
||
| def compare_ids(sample_to_ids, all_input_ids, svelt_ids): |
There was a problem hiding this comment.
This function doesn't do at all what we need. Because the IDs are not unique between different files, you can't process them this way.
You need to compare the IDs from the first input file with the IDs in the first position of ORIGINAL_IDS, from the second input file with the ones from the second position of ORIGINAL_IDS, and so on.
drtconway
left a comment
There was a problem hiding this comment.
Tat looks pretty good. Does it work? :)
No description provided.