Skip to content

Conversation

@cjbattey
Copy link
Collaborator

@cjbattey cjbattey commented Oct 4, 2020

This PR adds more informative error messages for merging sample metadata. The old version worked correctly but didn't point to which sample IDs were off. It also had a slightly weird behavior where you could add extra rows to the metadata table which would be silently dropped when we reindex on the genotype sample names. Now it checks that the metadata and genotypes sample vectors are the same length and that all genotype IDs are in the table. Any mismatches or duplicates are printed to screen.

Adding a duplicate row to sample_data now prints:

reading VCF  
[read_vcf] 11527 rows in 0.49s; chunk in 0.49s (23444 rows/s)   
[read_vcf] all done (23442 rows/s)   
error: problem merging genotypes and sample_data  
duplicate sample_data entries: ['msp_458']  

Changing one of the sample_data IDs prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: ['msp_typo_test']  

and deleting a row in sample_data prints:

[read_vcf] 11527 rows in 0.48s; chunk in 0.48s (24112 rows/s)  
[read_vcf] all done (24110 rows/s)  
error: problem merging genotypes and sample_data  
 vcf samples missing from sample_data: ['msp_458']   
 sample_data samples missing from vcf: []  

np.unique(sample_data['sampleID2'])[np.where(sample_data_counts>1)])
return
missing_from_metadata=[x not in np.array(sample_data['sampleID2']) for x in samples]
missing_from_vcf=[x not in samples for x in sample_data['sampleID2']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a version of locator that can run on a subset of samples within a zarr file - that might be good to bounce to from here (or make that an option)?

np.array(sample_data['sampleID2'])[missing_from_vcf])

def sort_samples(samples):
samples = np.array(samples.astype('str'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thumbs up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants