Skip to content

Push window_len down into the core e-divisive computation #84

@henrikingo

Description

@henrikingo

In Otava, Piotr introduced window_len as a user facing parameter, and for a given data point, only window_len/2 data points are included in the computation. This gives better quality results, in particular it makes it possible to find two adjacent change points. This is because computation is focused on the local neighborhood of each candidate change point, and data points far away are completely ignored.

In Piotr's version, the input data is split into 2*datapoints/window_len windows and the regular e-divisive is run on each window. After this there is a process to merge the small windows together. The windows also overlap, so the same point could be found once or twice, with different p-value for each.

Mathematically it would be more elegant, and consistent, and potentially also a performance improvement (because we don't count every point twice because no need for overlap) to rather just do a single pass, as in the original e-divisive, but incorporate window length into the core e-divisive code. This could be something like this:

https://github.com/mongodb/signal-processing-algorithms/blob/74d6a8fa3a377ee5dce4821ba9dfd2f2528790b2/src/signal_processing_algorithms/e_divisive.py#L222-L250

    length = window_len
    n = 1   # Note: this is 2 in the quoted code, but that is unnecessarily conservative.
    w_start = max( n - window_len / 2, 0)
    w_stop = min( n + window_len / 2, len(series) )
    m = w_stop - n

    # term1 = sum(diffs[i][j] for i in range(w_start, n) for j in range(n, w_stop))
    term1 = np.sum(diffs[w_start:n, n:w_stop])

    # term2 = sum(diffs[i][k] for i in range(w_start, n) for k in range(i + 1, n))
    term2 = np.sum(np.triu(diffs[w_start:n, w_start:n], 0))

    # term3 = sum(diffs[j][k] for j in range(n, w_stop)
    #                         for k in range(j + 1, w_stop)) 
    term3 = np.sum(np.triu(diffs[n:w_stop, n + 1 : w_stop], 0))

    qhat_values[n] = self.calculate_q(term1, term2, term3, m, n)

    for n in range(3, (length - 2)):
        w_start = max( n - window_len / 2, 0)
        w_stop = min( n + window_len / 2, len(series) )
        m = w_stop - n

        # TODO: may need a +1 and / or -1 somewhere
        column_delta = np.sum(diffs[n - 1, w_start : n - 1])
        row_delta = np.sum(diffs[n:w_stop, n - 1])

        term1 = term1 - column_delta + row_delta
        term2 = term2 + column_delta
        term3 = term3 - row_delta

        qhat_values[n] = self.calculate_q(term1, term2, term3, m, n-w_start)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions