-
Notifications
You must be signed in to change notification settings - Fork 79
Description
#1825 added support for alignments, which passed through arguments for supporting missing data to the haplotypes method. However, the initial implementation was wrong. Consider the following example:
# 2.00┊ 4 ┊
# ┊ ┏━┻┓ ┊
# 1.00┊ ┃ 3 ┊
# ┊ ┃ ┏┻┓ ┊
# 0.00┊ 0 1 2 5 ┊
# 0 10
# | |
# pos 2 9
# anc A T
@tests.cached_example
def ts(self):
ts = tskit.Tree.generate_balanced(3, span=10).tree_sequence
tables = ts.dump_tables()
tables.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0)
tables.sites.add_row(2, ancestral_state="A")
tables.sites.add_row(9, ancestral_state="T")
tables.mutations.add_row(site=0, node=0, derived_state="G")
tables.mutations.add_row(site=1, node=3, derived_state="C")
return tables.tree_sequence()Then if we call ts.as_fasta(reference_sequence="0123456789") we get
>n0
01G345678T
>n1
01A345678C
>n2
01A345678C
>n5
01N345678N
I think this is wrong because we're filling in the sites with Ns for n5, but we're returning the reference for all other sites, despite the topology stating that n5 is unknown for the full sequence length. So, I think n5 should be all "N"s here.
In general, I think we should return missing data for any part of the sequence where a node is isolated (there's a corner case here to consider regarding mutations above those nodes at sites in the missing region, though). This won't be possible without implementing alignments in C. If we're doing that, we should also probably think about how to implement indels etc (see #1775), which are now logically possible within the discrete-genome alignments framework.