-
Notifications
You must be signed in to change notification settings - Fork 21
Description
In Otava, Piotr introduced window_len as a user facing parameter, and for a given data point, only window_len/2 data points are included in the computation. This gives better quality results, in particular it makes it possible to find two adjacent change points. This is because computation is focused on the local neighborhood of each candidate change point, and data points far away are completely ignored.
In Piotr's version, the input data is split into 2*datapoints/window_len windows and the regular e-divisive is run on each window. After this there is a process to merge the small windows together. The windows also overlap, so the same point could be found once or twice, with different p-value for each.
Mathematically it would be more elegant, and consistent, and potentially also a performance improvement (because we don't count every point twice because no need for overlap) to rather just do a single pass, as in the original e-divisive, but incorporate window length into the core e-divisive code. This could be something like this:
length = window_len
n = 1 # Note: this is 2 in the quoted code, but that is unnecessarily conservative.
w_start = max( n - window_len / 2, 0)
w_stop = min( n + window_len / 2, len(series) )
m = w_stop - n
# term1 = sum(diffs[i][j] for i in range(w_start, n) for j in range(n, w_stop))
term1 = np.sum(diffs[w_start:n, n:w_stop])
# term2 = sum(diffs[i][k] for i in range(w_start, n) for k in range(i + 1, n))
term2 = np.sum(np.triu(diffs[w_start:n, w_start:n], 0))
# term3 = sum(diffs[j][k] for j in range(n, w_stop)
# for k in range(j + 1, w_stop))
term3 = np.sum(np.triu(diffs[n:w_stop, n + 1 : w_stop], 0))
qhat_values[n] = self.calculate_q(term1, term2, term3, m, n)
for n in range(3, (length - 2)):
w_start = max( n - window_len / 2, 0)
w_stop = min( n + window_len / 2, len(series) )
m = w_stop - n
# TODO: may need a +1 and / or -1 somewhere
column_delta = np.sum(diffs[n - 1, w_start : n - 1])
row_delta = np.sum(diffs[n:w_stop, n - 1])
term1 = term1 - column_delta + row_delta
term2 = term2 + column_delta
term3 = term3 - row_delta
qhat_values[n] = self.calculate_q(term1, term2, term3, m, n-w_start)