Support missing data in alignments

#1825 added support for alignments, which passed through arguments for supporting missing data to the haplotypes method. However, the initial implementation was wrong. Consider the following example:

```python
    # 2.00┊   4     ┊
    #     ┊ ┏━┻┓    ┊
    # 1.00┊ ┃  3    ┊
    #     ┊ ┃ ┏┻┓   ┊
    # 0.00┊ 0 1 2 5 ┊
    #     0        10
    #      |      |
    #  pos 2      9
    #  anc A      T
    @tests.cached_example
    def ts(self):
        ts = tskit.Tree.generate_balanced(3, span=10).tree_sequence
        tables = ts.dump_tables()
        tables.nodes.add_row(flags=tskit.NODE_IS_SAMPLE, time=0)
        tables.sites.add_row(2, ancestral_state="A")
        tables.sites.add_row(9, ancestral_state="T")
        tables.mutations.add_row(site=0, node=0, derived_state="G")
        tables.mutations.add_row(site=1, node=3, derived_state="C")
        return tables.tree_sequence()
```

Then if we call ``ts.as_fasta(reference_sequence="0123456789")`` we get
```
>n0
01G345678T
>n1
01A345678C
>n2
01A345678C
>n5
01N345678N
```
I think this is wrong because we're filling in the sites with Ns for ``n5``, but we're returning the reference for all other sites, despite the topology stating that ``n5`` is unknown for the full sequence length. So, I think ``n5`` should be all "N"s here.

In general, I think we should return missing data for any part of the sequence where a node is isolated (there's a corner case here to consider regarding mutations above those nodes at sites in the missing region, though). This won't be possible without implementing ``alignments`` in C. If we're doing that, we should also probably think about how to implement indels etc (see #1775), which are now logically possible within the discrete-genome alignments framework.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support missing data in alignments #1896

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support missing data in alignments #1896

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions