Skip to content

Added output/input comparison file#6

Open
lyla-kumari wants to merge 6 commits intodrtconway:developmentfrom
lyla-kumari:testing
Open

Added output/input comparison file#6
lyla-kumari wants to merge 6 commits intodrtconway:developmentfrom
lyla-kumari:testing

Conversation

@lyla-kumari
Copy link
Copy Markdown

No description provided.

@lyla-kumari
Copy link
Copy Markdown
Author

Tested missing variants in input files and output files separately (without running svelt again)

Haven't done coverage testing yet

@lyla-kumari lyla-kumari changed the base branch from main to development July 18, 2025 04:15
for rec in vcf:
original_ids = rec.info.get("ORIGINAL_IDS")
if not original_ids:
continue
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a row in the output of svelt does not contain ORIGINAL_IDS, that is an error in the output, so the script should complain, even if it keeps going. I'd consider writing out rec to stdout or you could even make a file containing the problem records.

ids = set()
for rec in vcf:
if rec.id and rec.id != ".":
clean_id = rec.id.split('_', 1)[-1] if '_' in rec.id else rec.id
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty worried about this line. If it's not empty (Does pysam return "." or None for absent IDs?), we shouldn't be modifying it, otherwise it might collide with other IDs in this file!

Can you explain why this line is there? There may be an excellent reason, but I can't see it.

for rec in vcf:
if rec.id and rec.id != ".":
clean_id = rec.id.split('_', 1)[-1] if '_' in rec.id else rec.id
ids.add(clean_id)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a very good idea to check to see if the ID is in the set already, so you can produce a warning about duplicate IDs in the input.


def extract_input_ids(input_files):
sample_to_ids = {}
all_ids = set()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDs are not guaranteed to be unique between different VCFs. So the same ID value could refer to two different variants in two VCF files. Accordingly, I don''t understand why you make a set of all the IDs in the inputs.

all_ids = set()
for path in input_files:
sample = os.path.basename(path).split('.')[0]
vcf = pysam.VariantFile(path)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using a with statement here with pysam.VariantFile(path) as vcf: to make sure the file gets closed. With the 3 samples we're using it's not going to be a problem, but it's a good defensive coding practice to use, to make sure that you don't get strange errors from running out of file descriptors.

If you're not familiar with with statements, I can explain them, or you can read up on them.

import os
from collections import defaultdict

def extract_input_ids(input_files):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would usually organise the code slightly differently, and make a function for reading the IDs from 1 file, and then either do the loop in the caller, or add an extra function. A function that does something with a file kind of makes logical sense, so helps guide the reader through the code.

Cargo.toml Outdated
[package]
name = "svelt"
version = "0.1.14"
version = "0.1.15"
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR isn''t changing svelt itself, so we don't really need a version bump.

original_ids = rec.info.get("ORIGINAL_IDS")
if not original_ids:
continue
if isinstance(original_ids, (list, tuple)):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is wrong, because it loses track of which ID came from which original file.

If you've got a bare string, then split on ,, then whether it was originally a list or tuple, or you split it, you want to iterate and put the ID into an appropriate corresponding set.

So you should return a list of sets corresponding to the original files. Does that make sense?

svelt_ids.update(oid for oid in original_ids.split(',') if oid != '.')
return svelt_ids

def compare_ids(sample_to_ids, all_input_ids, svelt_ids):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't do at all what we need. Because the IDs are not unique between different files, you can't process them this way.

You need to compare the IDs from the first input file with the IDs in the first position of ORIGINAL_IDS, from the second input file with the ones from the second position of ORIGINAL_IDS, and so on.

Copy link
Copy Markdown
Owner

@drtconway drtconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tat looks pretty good. Does it work? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants