Skip to content

Nasty gotcha in the AFS calculation - mutation parents #2729

@lukashuebner

Description

@lukashuebner

I get strange results when trying to calculate the AFS for the following dataset:

// Same as the example from the PLOS paper but with back and recurring mutations
/*
0.25┊     8   ┊         ┊         ┊
    ┊   ┏━┻━┓ ┊         ┊         ┊
0.20┊   ┃   ┃ ┊         ┊   7     ┊
    ┊   ┃   ┃ ┊         ┊ ┏━┻━┓   ┊
0.17┊   6   ┃ ┊   6     ┊ ┃   ┃   ┊
    ┊ ┏━┻┓  ┃ ┊ ┏━┻━┓   ┊ ┃   ┃   ┊
0.09┊ ┃  5  ┃ ┊ ┃   5   ┊ ┃   5   ┊
    ┊ ┃ ┏┻┓ ┃ ┊ ┃ ┏━┻┓  ┊ ┃ ┏━┻┓  ┊
0.07┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃  4  ┊ ┃ ┃  4  ┊
    ┊ ┃ ┃ ┃ ┃ ┊ ┃ ┃ ┏┻┓ ┊ ┃ ┃ ┏┻┓ ┊
0.00┊ 0 1 3 2 ┊ 0 1 2 3 ┊ 0 1 2 3 ┊
  0.00      2.00      7.00      10.00
*/
char const* multi_tree_back_recurrent_nodes = "1  0       -1   0\n"
                                              "1  0       -1   0\n"
                                              "1  0       -1   1\n"
                                              "1  0       -1   1\n"
                                              "0  0.071   -1   -1\n"
                                              "0  0.090   -1   -1\n"
                                              "0  0.170   -1   -1\n"
                                              "0  0.202   -1   -1\n"
                                              "0  0.253   -1   -1\n";

char const* multi_tree_back_recurrent_edges = "2 10 4 2\n"
                                              "2 10 4 3\n"
                                              "0 10 5 1\n"
                                              "0 2  5 3\n"
                                              "2 10 5 4\n"
                                              "0 7  6 0,5\n"
                                              "7 10 7 0,5\n"
                                              "0 2  8 2,6\n";

char const* multi_tree_back_recurrent_sites = "1      0\n"
                                              "4.5    0\n"
                                              "8.5    0\n";

/* site, node, derived_state, [parent, time] */
char const* multi_tree_back_recurrent_mutations = "0    6   1\n"  // to derived state
                                                  "0    5   0\n"  // back to ancestral state
                                                  "0    3   1\n"  // and once more to the derived state
                                                  "1    5   1\n"  // to derived state
                                                  "1    4   0\n"  // back to ancestral state
                                                  "2    4   1\n"; // to derived state

/* Two (diploid) individuals */
char const* multi_tree_back_recurrent_individuals = "0      0.2,1.5    -1,-1\n"
                                                    "0      0.0,0.0    -1,-1\n";

// [...]
tsk_treeseq_t tskit_tree_sequence;
tsk_id_t      samples[]          = {0, 1, 2, 3};
tsk_size_t    sample_set_sizes[] = {4, 0};
double        ref_result[5];
int           ret;

tsk_treeseq_from_text(
      &tskit_tree_sequence,
      10,
      multi_tree_back_recurrent_nodes,
      multi_tree_back_recurrent_edges,
      NULL,
      multi_tree_back_recurrent_sites,
      multi_tree_back_recurrent_mutations,
      multi_tree_back_recurrent_individuals,
      NULL,
      0
);

// Polarised AFS
ret = tsk_treeseq_allele_frequency_spectrum(
      &tskit_tree_sequence,
      1,
      sample_set_sizes,
      samples,
      0,
      NULL,
      TSK_STAT_POLARISED,
      ref_result
);
assert(ret == 0);
fmt::print("tskit AFS: {}\n", fmt::join(ref_result, ", "));
// Output:          tskit AFS: 0, 0, 1, 1, 0
// Expected output: tskit AFS: 0, 1, 2, 0, 0

Am I doing something wrong when calling the tskit functions? Shouldn't the entries in the AFS histogram always sum up to the number of sites (3 in this case)?

I also find something else unintuitive, maybe you can help: If all sites are in the derived state, why is the site counted as if all sites were in ancestral state?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions