Replacing signal_processing_algorithms with internal implementation #96

Sowiks · 2025-11-11T20:20:01Z

This is a PR request to replace mongodb signal_processing_algorithms package with internal implementation as discussed in https://lists.apache.org/thread/4vwp79kmsjd3zbf4fjcgkggf33jot65c . I tried to do minimal changes to existing code (analysis.py and series.py). Better integration is possible, but it can wait till next PR.

There is, however, an issue that I identified during my implementation of the methodology from A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data [Matteson and James](https://arxiv.org/abs/1306.4933). Long story short, it doesn't seem that signal_processing_algorithms==1.3.5 has correct implementation. Namely, let's look at section 2.2 Estimating the Location of a Change Point from the paper, more specifically formula (7) and the following discussion in the last paragraph of that section. I add them here for your convenience:

... Let $Z_1, \cdots Z_t \in ℝ^d$ be an independent sequence of observations and let $1 \leq \tau < \kappa \leq T$ be constants. Now define the following sets $X_\tau = \{ Z_1, Z_2, \cdots , Z_\tau \}$ and $Y_\tau(\kappa) = \{ Z_{\tau + 1}, Z_{\tau + 2} , \cdots , Z_\kappa \}$. A change point location $\hat{\tau}$ is then estimated as
$$(\hat{\tau}, \hat{\kappa}) = \text{arg}\max\limits_{(\tau, \kappa)} \hat{Q} (X_\tau, Y_\tau(\kappa); \alpha).$$
... If it is known that at most one change point exists, we $\kappa = T$. Otherwise, the variable $\kappa$ is introduced to alleviate a weakness of bisection, as mentioned in Venkatraman (1992), in which it may be more difficult to detect certain types of distributional changes in the multiple change point setting using only bisection. For example, if we fix $\kappa = T$ and the set $Y_\tau(T)$ contains observations across multiple change points (e.g., distinct distributions), then it is possible that the resulting mixture distribution in $Y_\tau(T)$ is indistinguishable from the distribution of the observations in $X_\tau$, even when $\tau$ corresponds to a valid change point. We avoid this confounding by allowing $\kappa$ to vary, with minimal computational cost by storing the distances mentioned above. This modification to bisection is similar to that taken in Olshen and Venkatraman (2004).

The main idea of that section is to allow $\kappa$ to vary, not to be simply set to the end of the series $\kappa=T$. However, when I implemented the methodology as in the paper (allowing $\tau < \kappa \leq T$ to vary) tigerbeetle tests failed. With little experimentation I found that erroneous implementation (with fixed $\kappa=T$) resolves the issues with tigerbeetle tests, which is unlikely to be a coincidence. I think this is because signal_processing_algorithms package has a mistake/typo in it where $\kappa$ is fixed at $T$ (at least version 1.3.5). Moreover, this would also explain arguments from Hunter: Using Change Point Detection to Hunt for Performance Regressions [Fleming et al.] that caught my eye. In the section 3.3 Fixed-Sized Windows the authors say:

As we began using Hunter on larger and larger data series, we discovered that change points identified in previous runs would suddenly disappear from Hunter’s results. This issue turned out to be caused by performance regressions that were fixed shortly after being introduced. This is a known issue with E-divisive means and is discussed in [5]. Because E-divisive means divides the time series into two parts, most of the data points on either side of the split showed similar values. The algorithm, therefore, by design, would treat the two nearby changes as a temporary anomaly, rather than a persistent change, and therefore filter it out.

The issues they discussed in that section seems to be related to the same idea: if you don't allow $\kappa$ to vary, the algorithm might miss some change points if they are within the interval. I wonder if they also used fixed $\kappa$ instead, which lead to the described issues.

Nevertheless, going back to this PR. It contains ~~two~~ three commits:

First commit matches the output of signal_processing_algorithms in all tests. If there is an error in my logic somewhere, this is the commit to replace the signal_processing_algorithms package.
Second commit also corrects fixed $\kappa$ issues. This results in different results in tigerbeetle tests, which was corrected here.
Edited: added support for Python 3.8 and 3.9.

Finally, I included some visualization of tigerbeetle tests for commit-1 vs commit-2 for you to see if they make sense.

import matplotlib.pyplot as plt
import numpy as np

series = [26705, 26475, 26641, 26806, 26835, 26911, 26564, 26812, 26874, 26682, 15672, 26745, 26460, 26977, 2685
 23547, 23674, 23519, 23670, 23662, 23462, 23750, 23717, 23524, 23588, 23687, 23793, 23937, 23715, 23570, 23730, 23690, 23699, 23670, 23860, 23988, 23652, 23681, 23798, 23728, 23604, 23523, 23412, 23685, 23773, 23771, 23718, 23409, 23739, 23674, 23597, 23682, 23680, 23711, 23660, 23990, 23938, 23742, 23703, 23536, 24363, 24414, 24483, 24509, 24944, 24235, 24560, 24236, 24667, 24730, 28346, 28437, 28436, 28057, 28217, 28456, 28427, 28398, 28250, 28331, 28222, 28726, 28578, 28345, 28274, 28514, 28590, 28449, 28305, 28411, 28788, 28404, 28821, 28580, 27483, 26805, 27487, 27124, 26898, 27295, 26951, 27312, 27660, 27154, 27050, 26989, 27193, 27503, 27326, 27375, 27513, 27057, 27421, 27574, 27609, 27123, 27824, 27644, 27394, 27836, 27949, 27702, 27457, 27272, 28207, 27802, 27516, 27586, 28005, 27768, 28543, 28237, 27915, 28437, 28342, 27733, 28296, 28524, 28687, 28258, 28611, 29360, 28590, 29641, 28965, 29474, 29256, 28611, 28205, 28539, 27962, 28398, 28509, 28240, 28592, 28102, 28461, 28578, 28669, 28507, 28535, 28226, 28536, 28561, 28087, 27953, 28398, 28007, 28518, 28337, 28242, 28607, 28545, 28514, 28377, 28010, 28412, 28633, 28576, 28195, 28637, 28724, 28466, 28287, 28719, 28425, 28860, 28842, 28604, 28327, 28216, 28946, 28918, 29287, 28725, 29148, 29541, 29137, 29628, 29087, 28612, 29154, 29108, 28884, 29234, 28695, 28969, 28809, 28695, 28634, 28916, 29852, 29389, 29757, 29531, 29363, 29251, 29552, 29561, 29046, 29795, 29022, 29395, 28921, 29739, 29257, 29455, 29376, 29528, 28909, 29492, 28984, 29621, 29026, 29457, 29102, 29114, 28924, 29162, 29259, 29554, 29616, 29211, 29367, 29460, 28836, 29645, 29586, 28848, 29324, 28969, 29150, 29243, 29081, 29312, 28923, 29272, 29117, 29072, 29529, 29737, 29652, 29612, 29856, 29012, 30402, 29969, 29309, 29439, 29285, 29421, 29023, 28772, 29692, 29416, 29267, 29542, 29904, 30045, 29739, 29945, 29141, 29163, 29765, 29197, 29441, 28910, 29504, 29614, 29643, 29506, 29420, 29672, 29432, 29784, 29888, 29309, 29247, 29816, 29254, 29813, 29451, 29382, 29618, 28558, 29845, 29499, 29283, 29184, 29246, 28790, 29952, 29145, 29415, 30437, 29227, 29605, 29859, 29156, 29807, 29406, 29734, 29861, 29140, 29983, 29832, 29919, 29896, 29991, 29266, 29001, 29459, 29548, 29310, 29042, 29303, 29894, 29091, 29018, 29537, 29614, 29180, 29736, 29500, 29218, 29581, 28906, 28542, 29306, 28987, 29878, 28865, 30272, 29707, 29662, 29815, 30492, 29347, 30096, 29054, 30238, 28813, 31895, 28915]

def plot(old, new):
    plt.style.use('ggplot')
    plt.plot(series)
    plt.plot(old, np.take(series, old), 'o')
    plt.plot(new, np.take(series, new), 'kx')
    plt.legend(['Data', 'Old', 'New'])
    plt.show()

# window_len=30, max_pvalue=0.01, min_magnitude=0.05
plot(old=[27, 71], new=[15, 71])

# window_len=30, max_pvalue=0.05, min_magnitude=0.05
plot(old=[16, 71], new=[15, 71])

# window_len=30, max_pvalue=0.1, min_magnitude=0.05
plot(old=[16, 71], new=[10, 11, 15, 71, 363])

# window_len=30, max_pvalue=0.2, min_magnitude=0.05
plot(old=[16, 71], new=[10, 11, 15, 71])

# window_len=30, max_pvalue=0.2, min_magnitude=0.0
plot(
    old=[16, 27, 29, 56, 58, 60, 61, 69, 71, 82, 83, 91, 95, 108, 114, 116, 117, 131, 138, 142, 148, 165, 167, 178, 187, 189, 190, 192, 206, 212, 213, 220, 241, 243, 244, 246, 247, 249, 260, 266, 268, 272, 274, 275, 278, 282, 284, 288, 295, 297, 311, 314, 325, 330, 347, 351],
    new=[3, 6, 7, 10, 11, 13, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 56, 58, 61, 65, 66, 69, 71, 74, 76, 82, 95, 108, 117, 125, 126, 129, 131, 136, 137, 142, 148, 165, 169, 187, 190, 192, 197, 200, 212, 220, 241, 243, 246, 247, 249, 250, 260, 265, 266, 268, 278, 282, 288, 305, 306, 325, 330, 337, 338, 340, 347, 349, 363]
)

# window_len=30, max_pvalue=0.1, min_magnitude=0.0
plot(
    old=[16, 27, 29, 56, 58, 61, 71, 82, 95, 113, 116, 117, 131, 138, 142, 148, 157, 165, 167, 178, 187, 189, 192, 206, 212, 213, 220, 246, 247, 249, 260, 266, 268, 272, 278, 282, 311, 312, 325, 330, 347, 351],
    new=[3, 6, 10, 11, 15, 16, 28, 29, 35, 37, 39, 41, 44, 48, 49, 61, 71, 95, 117, 131, 142, 148, 165, 169, 192, 206, 212, 260, 265, 268, 278, 282, 288, 305, 363]
)

# window_len=30, max_pvalue=0.01, min_magnitude=0.0
plot(
    old=[27, 61, 71, 82, 95, 131, 142, 148, 192, 212, 249, 260, 265, 353],
    new=[15, 26, 61, 71, 95, 117, 131, 142, 148, 165, 169, 192, 212, 260]
)

# window_len=30, max_pvalue=0.001, min_magnitude=0.0
plot(
    old=[71, 95, 113, 131, 142, 148, 192, 212, 260],
    new=[15, 61, 71, 95, 117, 131, 142, 148, 192, 212, 260]
)

# window_len=30, max_pvalue=0.0001, min_magnitude=0.0
plot(
    old=[71, 95, 113, 131, 192, 212],
    new=[71, 95, 117, 131, 142, 148, 192, 212]
)

# window_len=30, max_pvalue=0.00001, min_magnitude=0.0
plot(old=[71, 95, 131, 192, 212], new=[71, 95, 131, 192, 212])

henrikingo · 2025-11-13T07:11:22Z

Thanks a lot @Sowiks for this! You have valuable skill in being able to grasp the academic level math and then still explain your findings to normal people with simple pictures. Btw this is why I like this tigerbeetle demo dataset from 2023. In 200+ points it exercises many of the phenomena you might encounter in this field, and so it captured your bug, or fix rather, too.

Amazingly I vaguely remember how this happened at MongoDB back then. I remember asking about this kappa and the people who had read the jameson paper (I would read it much later) explained that we can choose a value for it freely. So we did and I never thought of it again. We thought of it as a parameter we could choose, not that we were supposed to use all values. Since the by-the-book algorithm ends in a monte carlo simulation, we apparently accepted the fact that the reference implementation in R often produced different change points.

So it seems with your fix the algorithm will perform even better than it ever did. (And even now Otava has outperformed all alternatives with a good margin!) It now seems to hit the blind spots that always annoyed me. In a way Piotr's approach applying small windows kind of achieves the same behavior.

Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way? If yes, then it means that storing the previous results becomes the norm and we should pay more attention to a format and api for doing that.

Will review code over the weekend but from the text and pictures I can already tell this is good stuff. Thanks for contributing!

henrikingo · 2025-11-13T20:19:48Z

Btw, your illustrations also nicely show that with the bugs fixed, unless you're really lax about perf regressions, then min_magnitude is actually unnecessary. It has IMO historically been used to cover up bugs. (Of which this is not the first one.)

henrikingo

It was a joy to review this. Comments are on understandability, comments, naming.

Oh, please coordinate with Aleksander and the 0.7.0 release when to merge this.

henrikingo · 2025-11-13T20:39:41Z

otava/analysis.py

+                left_interval = interval
+                right_interval = intervals[i + 1]
+                break
+            elif (interval.start is None or interval.start < candidate.index) and (interval.stop is None or candidate.index < interval.stop):


How can interval start ever be None? If the end points aren't known I'd say it's not an interval at all?

Here, None is not unknown, but a value corresponding to either start or end of the list. When you call array[i:j] it creates a slice slice(i, j) and you are getting a result of array[slice(i, j)]. However, in python you can omit starting and/or ending parameter in slice: array[0:i] == array[:i] == array[slice(None, i)] and array[i:len(array)] == array[i:] == array[slice(i, None)]. This code is just to support such slices.

Ah, makes sense. Could you add a short comment somewhere in the class definition.

henrikingo · 2025-11-13T20:55:21Z

otava/analysis.py

+        1. Divisive algorithm.
+           if the candidate is a new potential change point, i.e., its index is inside any interval, then
+           we split the interval by the candidate's index to get left and right subseries.
+        2. Merge step in t-test algorithm.


Terminology: I would say any variant of the algorithm can use a T-Test, Permutation test, or something else. The Merge step is part of what I'd call split-merge strategy, or perhaps weak change points version. I guess the Hunter paper just called this the Fixed size windows, but I could think of many kinds of windows (such as sliding) that don't need to be merged. And this is at least closely related to the weak change points, but I'm unsure whether weak change points will be needed now? Otoh maybe we'll soon get rid of the split-merge too, in which case we can ignore this discussion.

Oh, maybe we haven't been introduced properly. I'm the one who is serious about naming things. David is the one who knows all about cache invalidation.

I agree that terminology is sloppy here :( By "Divisive algorithm" I meant the process of splitting the interval (0, len(series)) into subintervals via change points. By "Merge step" I meant the process of merging split intervals when we eliminate weak change points. It doesn't have to be t-test specifically, correct.

henrikingo · 2025-11-13T21:16:18Z

otava/analysis.py

-    pts = algo.get_change_points(series)
-    return pts, None
+def compute_change_points_orig(series: Sequence[SupportsFloat], max_pvalue: float = 0.001, seed: Optional[int] = None) -> Tuple[PermCPList, Optional[PermCPList]]:
+    tester = PermutationsSignificanceTester(alpha=max_pvalue, permurations=100, calculator=PairDistanceCalculator, seed=seed)


permutations

Thanks! Will correct the typo. But at least it's a consistent typo :)

henrikingo · 2025-11-13T21:31:04Z

otava/change_point_divisive/base.py

+@dataclass
+class CandidateChangePoint:
+    '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice'''
+    index: int


Maybe it would be clearer if somewhere you also use either:

start < index <= end

or the equivalent

)start, end)

and then say, "...which correspond to the slice [start:end+1] in python, as well as range(start,end+1)

Could add a mention that indexes are always the original index from the full series [0,T] and not zero-based for each interval. (start < index <= end, never 0 < index <= end-start

A change point at index 0 is impossible, because a single point just cannot change But this IMO follows logically, it is not a pre-condition. (There's a fun philosophical debate here, where exactly the change points are? In the case of a series of git commits, it is clear that the change point is the commit that causes the regression/improvement. Others could argue change is what happens in the gaps between the points of measurement.) In Otava we index change points from 1 to T, because this way they match with their corresponding test result in the input series. So you could make the conclusion Otava is in the camp that change happens, or is observed at least, at the point after the change.

Anyway, should we add an invariant here that index > 0?

I will correct comments, so they consistently talk in term of slices, not indexes. (I will keep indexes in calculator.py because it's explicit there where we switch from slices to indexes + in my opinion, formulas are easier to follow in index notations there)

I was thinking of doing that in the next PR, when I make a better integration across the otava. One of the criteria of the current implementation was to minimize changes to the existing code in otava for easier review.

henrikingo · 2025-11-13T21:48:46Z

otava/analysis.py

+            change_points[index] = tester.change_point(cp.to_candidate(), series, intervals)

        recompute(weakest_cp_index)
        recompute(weakest_cp_index + 1)


Do we test for the case where weakest_cp_index == max(index)

We don't. Good catch! It should be

if weakest_cp_index == len(change_points): recompute(weakest_cp_index - 1) else: recompute(weakest_cp_index) recompute(weakest_cp_index + 1)

henrikingo · 2025-11-13T23:07:27Z

otava/change_point_divisive/calculator.py

+        kappas = np.arange(start + 2, end + 2)[None, :]
+
+        A = np.zeros((end - start, end - start))
+        A_coefs = 2 / (kappas - start)


Just for discussion... but it is my opinion that these coefficient are added to the formula only to pass some robustness "almost certainly at the limit" proof. They are never really explained or justified the way every other term is. They have the effect of muting the q values at the ends of a interval, and significantly inflating those in the midldle. As a result, given many good candidates q, the algorithm tends to pick change points first in the middle of a series.

Anyway, it's for the future, but to me those are free game once we want to make any mods to this implementation.

They scale the values of the estimate to be unbiased (in statistical sense). I strongly suggest keeping them if we care about theoretical support behind the methods. With that being said, the Hunter paper introduces the use of t-test without any theory behind it (unless I missed something), and it's still being used.

There's some value in the theoretical soundness yes. And you are correct that the use of t-test was purely based on empirical observations coupled with subjective judgement = it's faster, deterministic and finds more things that upon inspection a human agrees were valid bugs. (The hunter paper also introduced a method for generating real world but still objective test data sets, but the chronology of development is that changes to the algorithm were done first.) That said the project was done by two PhD's, one of which was in math, and they tested other significance tests too. So it wasn't as uneducated as just me looking at some graphs and picking the one I like. (My above comment is in that category for sure, and like I said, this is just discussion.)

And theoretically speaking I think the t-test is wrong, because performance test results are not known to be normally distributed. Unless of course the thinking is that ultimately everything is.

henrikingo · 2025-11-13T23:12:26Z

otava/change_point_divisive/calculator.py

+        C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0))
+
+        # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`.
+        # So, critical point is `τ = i + 1`.


The fact that you end up with parameters i+1, j+2 here, rather than the more common i, j+1, suggests to me you need to shift your indexing to the left one step. (0,T) or )0,T) not (1,T+1)

If you don't have time to do that in the near future, the shifting of indexes can be a separat followup task too.

Here it is because values of Q is shifted with respect to the series. Function Q is defined on non-empty consecutive subseries, so the shortest possible subsequences are Q(series[0:1], series[1:2]), which would correspond to Q[0, 0]. The reason I didn't keep Q[i, j] ~ Q(series[0:i], series[i:j] was to reduce matrix sizes by cutting out columns and rows with only zeros. I guess I tried optimizing what I could :)

Let me know if you would prefer to pad matrices with zeros for the sake of indexing.

No this is ok. I think it's clearer now on second read.

henrikingo · 2025-11-13T23:19:09Z

otava/change_point_divisive/calculator.py

+
+    def get_candidate_change_point(self, interval: slice) -> CandidateChangePoint:
+        '''For a given `slice(start, stop)` finds potential critical point in subsequence series[slice],
+        i.e., from index `start` to `stop - 1` inclusive. For simplicity, we'll use `end = stop - 1`.


To be super clear, maybe use "interval" and ,) if you are talking math, and slice with [] and range with () in python context

otava/change_point_divisive/calculator.py

henrikingo · 2025-11-13T23:28:00Z

otava/change_point_divisive/calculator.py

+
+        Q = self._get_Q_vals(start, end)
+        i, j = np.unravel_index(np.argmax(Q), Q.shape)
+        return CandidateChangePoint(index=i + 1 + start, qhat=Q[i][j])


Also this feels reassuringly familiar <3

Sowiks · 2025-11-16T06:33:33Z

Thanks a lot @Sowiks for this! You have valuable skill in being able to grasp the academic level math and then still explain your findings to normal people with simple pictures. Btw this is why I like this tigerbeetle demo dataset from 2023. In 200+ points it exercises many of the phenomena you might encounter in this field, and so it captured your bug, or fix rather, too.

Amazingly I vaguely remember how this happened at MongoDB back then. I remember asking about this kappa and the people who had read the jameson paper (I would read it much later) explained that we can choose a value for it freely. So we did and I never thought of it again. We thought of it as a parameter we could choose, not that we were supposed to use all values. Since the by-the-book algorithm ends in a monte carlo simulation, we apparently accepted the fact that the reference implementation in R often produced different change points.

So it seems with your fix the algorithm will perform even better than it ever did. (And even now Otava has outperformed all alternatives with a good margin!) It now seems to hit the blind spots that always annoyed me. In a way Piotr's approach applying small windows kind of achieves the same behavior.

Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way? If yes, then it means that storing the previous results becomes the norm and we should pay more attention to a format and api for doing that.

Will review code over the weekend but from the text and pictures I can already tell this is good stuff. Thanks for contributing!

Thank you for the flattering review :)

Regarding your question:

Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way?

It's kind of a loaded question, but the short answer is no. However, I think you'll be interested in the long answer.

It's not exactly the same because as we add a point to the end of series it might cause a different point to become the best candidate. The minimal example that I came up with is [0, 29, 60] and adding 27 to it:

>>> series = np.array([0, 29, 60])
>>> calculator.get_next_candidate(slice(0, None)) #  whole series
CandidateChangePoint(index=2, qhat=41.33333333333333) # its x_2 = 60

However,

>>> series = np.array([0, 29, 60, 27])
>>> calculator.get_next_candidate(slice(0, None)) #  whole series
CandidateChangePoint(index=1, qhat=41.5) # its x_1 = 29

Now, the intuition here is that when we had only three points, the jump 0 -> 29 is smaller than 29 -> 60, so 60 has the best potential to be a change point. When we added 4th point, we had enough evidence to see that jump 0 -> [29, 60, 27] has the most potential now. So, the answer is "no" in general. However, I'm not sure if this is possible as the length of sequence increases and whether one point can make a difference anymore.

Next, if I understand correctly why you are asking this question, it is because of the issues described in the Hunter paper. Namely, the claim:

As we began using Hunter on larger and larger data series, we discovered that change points identified in previous runs would suddenly disappear from Hunter’s results. This issue turned out to be caused by performance regressions that were fixed shortly after being introduced. This is a known issue with E-divisive means and is discussed in [5]. Because E-divisive means divides the time series into two parts, most of the data points on either side of the split showed similar values. The algorithm therefore, by design, would treat the two nearby changes as a temporary anomaly, rather than a persistent change, and therefore filter it out.
Figure 1 illustrates this issue.

I tried to generate data similar to the one on the picture and run a few tests:

>>> def figure1_test(N):
...     base = 440 + np.random.randn(N) * 5
...     drop = 400 + np.random.randn(N) * 5
...     recover = 445 + np.random.randn(N) * 5
...     series = np.concatenate((base, drop, recover))
...     tester = PermutationsSignificanceTester(alpha=0.00001, permurations=100, calculator=PairDistanceCalculator, seed=1)
...     detector = ChangePointDetector(tester, PairDistanceCalculator)
...     points = detector.get_change_points(series)
...     return [p.index for p in points]
...
>>> figure1_test(10)
[10, 20]
>>> figure1_test(100)
[100, 200]
>>> figure1_test(1000) #  took quite a while because of the permutation tester
[1000, 2000]

As you can see (1) the change points are correctly identified for this pattern (2) on wide range of sequence length.

henrikingo · 2025-11-16T20:26:30Z

Regarding your question:

Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way?

It's kind of a loaded question, but the short answer is no. However, I think you'll be interested in the long answer.

It's not exactly the same because as we add a point to the end of series it might cause a different point to become the best candidate. The minimal example that I came up with is [0, 29, 60] and adding 27 to it:

When asking the question, I had slightly misunderstood where this happens: This is about generating the set of Q-values, not the set of change points found. (weak or regular...)

So I think the correct use of kappa increases the set of q-values, and therefore candidate change points, so that the change points that the Hunter paper describes as missing, or disappearing rather, could be found. But you're right that only the best one will be picked, and then of course in the next iteration things have changed, so I guess it is not guaranteed that the variating of kappa will generate all the same change points as would be found by computing the algorithm over all {series[:1], series[:2] ... series[:N]} and just keeping the union of all change points.

Even so, the effect of:

0 < tau < kappa <= T, where kappa goes from 2 to T (your implementation)

seems to me very close to

0 < tau < kappa = t, where t goes from 2 to T (my question)

But as you point out, in the first case we may not actually pick all the change points that would be generated in the second case. But I feel like the potential is there, as the first case should generate the same "peaks" of q-values, but it's not guaranteed, only more likely.

Next, if I understand correctly why you are asking this question, it is because of the issues described in the Hunter paper. Namely, the claim:

My motivation for the question was to understand whether this fully explains the phenomenon of change points that first are found and then disappear. It seems to me it mostly does, but we cannot say for certain it "fully" does so in all scenarios.

def figure1_test(N):
... base = 440 + np.random.randn(N) * 5
... drop = 400 + np.random.randn(N) * 5
... recover = 445 + np.random.randn(N) * 5
... series = np.concatenate((base, drop, recover))

I think to generate a data set that the hunter paper was concerned with, you need the drop to be short, maybe even 1-2 only:

 drop = 400 + np.random.randn(2) * 5

Sowiks · 2025-11-17T18:43:17Z

I think to generate a data set that the hunter paper was concerned with, you need the drop to be short, maybe even 1-2 only:
 drop = 400 + np.random.randn(2) * 5

Correct me If I'm wrong, but my understanding was that there are two separate problems:

Disappearing of previous found critical points.
Not detecting the critical points in the first place (because the number of abnormal points is small)

henrikingo · 2025-11-17T21:16:52Z

I think to generate a data set that the hunter paper was concerned with, you need the drop to be short, maybe even 1-2 only:
 drop = 400 + np.random.randn(2) * 5
Correct me If I'm wrong, but my understanding was that there are two separate problems:

Disappearing of previous found critical points.

Not detecting the critical points in the first place (because the number of abnormal points is small)

No, these are the same problem. The change points disappear when the interval/ window they are in, grows larger. I always assumed this was a feature: In a short timeseries, say 50-100 points, MongoDB e-divisive with typical parameters would ignore spikes that last a single point only, and might alert for a plateu of 2-3 points that then returns to the original level. (But even then would only produce 1 change point, because original MongoDB implementation needed a hard coded 3 points before it would alert anything at all, so it is not possible to find 2 neighboring change points. This is from the Matteson paper and their R reference implementation I believe defaulted to a leading 30 points or so. Which would be a long time to wait for a jira ticket if it was nightly builds!)

...where was I... So then if the series keeps growing , my interpretation is that the short lived change becomes less significant compared to the entire series, so eventually it is ignored by the algorithm, just as if it was a single point. Conversely, also a single point could trigger an alert if it was large enough. (At least assuming that the series on both of its sides aren't perfectly constant.)

The fix of adding a window is based on the above understanding: it creates a situation where the local computation doesn't take into account more than a small number of local points.

And this is why I asked earlier whether Kappa is now equivalent to observing a series grow from 1 point and computing the algorithm for every added point.

Sowiks · 2025-11-18T02:48:26Z

I see, thank you for clarification. I'll need to think about it.

…o match slice object notations.

Sowiks · 2025-11-23T05:33:45Z

Brought comments, code, and variable names across all files to the same indexing notations that are related to sub-series. Now everything is defined in the python slice notations, i.e., array[start : end]. First index start always included, last index end always excluded. Variable usage start and end is also consistent throughout the files. Variable name stop is not used, except for cases when working with python built-in slice object directly (in those cases the slice object field stop is equal to variables end).
Added comments describing original and split-merge change point detection algorithms. Got rid of mongodb implementation references in the comments.
Corrected assert statement in analysis.py
Added comments in analysis.py explaining None values for intervals.start and intervals.stop
Renamed _calculated_distances method to _calculate_pairwise_differences in PairDistanceCalculator class in calculator.py
Renamed significance threshold variable alpha to max_pvalue across all files for consistence and clarity
Corrected typos in variable name permurations to permutations across all files
Corrected annotation typo for variable power in calculator.py (int to float)
Corrected typo in calculator.py (matix to matrix)

Sowiks · 2025-11-23T06:11:07Z

I didn't fix the recompute calls for the case weakest_cp_index == max(index) in the analysis.py. I want to do this in the next PR. I have a suspicious that there is a bug there (both in master and current implementations), but I need more time to investigate.

henrikingo · 2025-11-23T15:42:05Z

Thank you @Sowiks , appreciate the attention to commenting the algorithm as understandable as possible, and attention to detail in keeping a certain consistency in both variable naming and indexing. You already have my approval from the previous review, but wanted to re-affirm it here that I don't think I have any further comments on this PR.

We should however wait until Tuesday to hear from @Gerrrr how we proceed with the 0.7.0 release. I'm sensing we might create a separate branch for the backward compatible releases. If not, the we'll just have to wait a few weeks while we iterate through the ASF voting process to get the release out of the way.

henrikingo · 2025-11-23T15:45:26Z

I didn't fix the recompute calls for the case weakest_cp_index == max(index) in the analysis.py. I want to do this in the next PR. I have a suspicious that there is a bug there (both in master and current implementations), but I need more time to investigate.

First question is whether this split-merge-recompute is even needed after the re-implementation. If the problem that it fixes goes away with your addition of Kappa, then we should remove it.

Gerrrr · 2025-11-28T04:29:25Z

Great work, @Sowiks! FYI I cut a separate branch for the next 0.7.0 release - https://github.com/apache/otava/tree/0.7, so feel free to merge this PR whenever you are ready.

henrikingo · 2025-11-28T12:54:56Z

I guess we'll have to merge it, but @Sowiks do you have the contributor agreement (ICLA) signed with the ASF? (I didn't find a place where I could check myself.)

dave2wave · 2025-11-28T17:39:59Z

I checked and an ICLA has not yet been filed.

henrikingo · 2025-11-29T13:12:36Z

@Sowiks : https://www.apache.org/licenses/contributor-agreements.html

Short version is, download the ICLA pdf, sign it either with a pen or GnuPG, email it to secretary@apache.org

Gerrrr · 2025-12-02T05:31:04Z

@dave2wave @michaelsembwever @henrikingo to my knowledge, a casual contributor does not have to sign ICLA. For example, I recall signing ICLA only right before becoming Apache Cassandra committer. Isn't it so?

henrikingo · 2025-12-02T13:28:00Z

At least the guide I link to above uses words like "all contributors" and "every developer". We should maybe do this on the mailing list if you want to debate it more thoroughly.

Sowiks · 2025-12-04T07:18:48Z

Sent the form.

henrikingo · 2025-12-04T17:23:07Z

Yay!

Sowiks added 2 commits November 11, 2025 00:00

Replacing signal_processing_algorithms with internal implementation

f75f356

Correct implementation of bisection method

a59a3cf

Sowiks marked this pull request as ready for review November 11, 2025 22:19

Support for Python 3.8 and 3.9

f8f81d8

henrikingo approved these changes Nov 14, 2025

View reviewed changes

Renamed variables, fixed typos, updated comments. Switched indexing t…

6aad9db

…o match slice object notations.

This was referenced Nov 29, 2025

Upgrade to signal_processing 2.0 branch #77

Closed

Refactor: Move things between Otava and Signal Processing #83

Open

henrikingo merged commit 5c5abc7 into apache:master Dec 4, 2025
4 checks passed

henrikingo mentioned this pull request Dec 7, 2025

Upgrade to most recent python 3.14 #109

Closed

Replacing signal_processing_algorithms with internal implementation #96

Replacing signal_processing_algorithms with internal implementation #96

Uh oh!

Conversation

Sowiks commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henrikingo commented Nov 13, 2025

Uh oh!

henrikingo commented Nov 13, 2025

Uh oh!

henrikingo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sowiks commented Nov 16, 2025

Uh oh!

henrikingo commented Nov 16, 2025

Uh oh!

Sowiks commented Nov 17, 2025

Uh oh!

henrikingo commented Nov 17, 2025

Uh oh!

Sowiks commented Nov 18, 2025

Uh oh!

Sowiks commented Nov 23, 2025

Uh oh!

Sowiks commented Nov 23, 2025

Uh oh!

henrikingo commented Nov 23, 2025

Sowiks commented Nov 11, 2025 •

edited

Loading