Skip to content

Support missing data in alignments #1896

@jeromekelleher

Description

@jeromekelleher

#1825 added support for alignments, which passed through arguments for supporting missing data to the haplotypes method. However, the initial implementation was wrong. Consider the following example:

    # 2.00┊   4     ┊
    #     ┊ ┏━┻┓    ┊
    # 1.00┊ ┃  3    ┊
    #     ┊ ┃ ┏┻┓   ┊
    # 0.00┊ 0 1 2 5 ┊
    #     0        10
    #      |      |
    #  pos 2      9
    #  anc A      T
    @tests.cached_example
    def ts(self):
        ts = tskit.Tree.generate_balanced(3, span=10).tree_sequence
        tables = ts.dump_tables()
        tables.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0)
        tables.sites.add_row(2, ancestral_state="A")
        tables.sites.add_row(9, ancestral_state="T")
        tables.mutations.add_row(site=0, node=0, derived_state="G")
        tables.mutations.add_row(site=1, node=3, derived_state="C")
        return tables.tree_sequence()

Then if we call ts.as_fasta(reference_sequence="0123456789") we get

>n0
01G345678T
>n1
01A345678C
>n2
01A345678C
>n5
01N345678N

I think this is wrong because we're filling in the sites with Ns for n5, but we're returning the reference for all other sites, despite the topology stating that n5 is unknown for the full sequence length. So, I think n5 should be all "N"s here.

In general, I think we should return missing data for any part of the sequence where a node is isolated (there's a corner case here to consider regarding mutations above those nodes at sites in the missing region, though). This won't be possible without implementing alignments in C. If we're doing that, we should also probably think about how to implement indels etc (see #1775), which are now logically possible within the discrete-genome alignments framework.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions