-
Notifications
You must be signed in to change notification settings - Fork 21
Replacing signal_processing_algorithms with internal implementation #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot @Sowiks for this! You have valuable skill in being able to grasp the academic level math and then still explain your findings to normal people with simple pictures. Btw this is why I like this tigerbeetle demo dataset from 2023. In 200+ points it exercises many of the phenomena you might encounter in this field, and so it captured your bug, or fix rather, too. Amazingly I vaguely remember how this happened at MongoDB back then. I remember asking about this kappa and the people who had read the jameson paper (I would read it much later) explained that we can choose a value for it freely. So we did and I never thought of it again. We thought of it as a parameter we could choose, not that we were supposed to use all values. Since the by-the-book algorithm ends in a monte carlo simulation, we apparently accepted the fact that the reference implementation in R often produced different change points. So it seems with your fix the algorithm will perform even better than it ever did. (And even now Otava has outperformed all alternatives with a good margin!) It now seems to hit the blind spots that always annoyed me. In a way Piotr's approach applying small windows kind of achieves the same behavior. Do I understand correctly that running this Kappa from 0 to T is exactly the same as if I would start with two points, then append one point at a time to the timeseries, re-running otava between each step, and then keeping all change points found along the way? If yes, then it means that storing the previous results becomes the norm and we should pay more attention to a format and api for doing that. Will review code over the weekend but from the text and pictures I can already tell this is good stuff. Thanks for contributing! |
|
Btw, your illustrations also nicely show that with the bugs fixed, unless you're really lax about perf regressions, then min_magnitude is actually unnecessary. It has IMO historically been used to cover up bugs. (Of which this is not the first one.) |
henrikingo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a joy to review this. Comments are on understandability, comments, naming.
Oh, please coordinate with Aleksander and the 0.7.0 release when to merge this.
| left_interval = interval | ||
| right_interval = intervals[i + 1] | ||
| break | ||
| elif (interval.start is None or interval.start < candidate.index) and (interval.stop is None or candidate.index < interval.stop): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can interval start ever be None? If the end points aren't known I'd say it's not an interval at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, None is not unknown, but a value corresponding to either start or end of the list. When you call array[i:j] it creates a slice slice(i, j) and you are getting a result of array[slice(i, j)]. However, in python you can omit starting and/or ending parameter in slice: array[0:i] == array[:i] == array[slice(None, i)] and array[i:len(array)] == array[i:] == array[slice(i, None)]. This code is just to support such slices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense. Could you add a short comment somewhere in the class definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
otava/analysis.py
Outdated
| 1. Divisive algorithm. | ||
| if the candidate is a new potential change point, i.e., its index is inside any interval, then | ||
| we split the interval by the candidate's index to get left and right subseries. | ||
| 2. Merge step in t-test algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terminology: I would say any variant of the algorithm can use a T-Test, Permutation test, or something else. The Merge step is part of what I'd call split-merge strategy, or perhaps weak change points version. I guess the Hunter paper just called this the Fixed size windows, but I could think of many kinds of windows (such as sliding) that don't need to be merged. And this is at least closely related to the weak change points, but I'm unsure whether weak change points will be needed now? Otoh maybe we'll soon get rid of the split-merge too, in which case we can ignore this discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe we haven't been introduced properly. I'm the one who is serious about naming things. David is the one who knows all about cache invalidation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that terminology is sloppy here :( By "Divisive algorithm" I meant the process of splitting the interval (0, len(series)) into subintervals via change points. By "Merge step" I meant the process of merging split intervals when we eliminate weak change points. It doesn't have to be t-test specifically, correct.
otava/analysis.py
Outdated
| pts = algo.get_change_points(series) | ||
| return pts, None | ||
| def compute_change_points_orig(series: Sequence[SupportsFloat], max_pvalue: float = 0.001, seed: Optional[int] = None) -> Tuple[PermCPList, Optional[PermCPList]]: | ||
| tester = PermutationsSignificanceTester(alpha=max_pvalue, permurations=100, calculator=PairDistanceCalculator, seed=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
permutations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will correct the typo. But at least it's a consistent typo :)
| @dataclass | ||
| class CandidateChangePoint: | ||
| '''Candidate for a change point. The point that maximizes Q-hat function on [start:end+1] slice''' | ||
| index: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be clearer if somewhere you also use either:
start < index <= end
or the equivalent
)start, end)
and then say, "...which correspond to the slice [start:end+1] in python, as well as range(start,end+1)
Could add a mention that indexes are always the original index from the full series [0,T] and not zero-based for each interval. (start < index <= end, never 0 < index <= end-start
A change point at index 0 is impossible, because a single point just cannot change But this IMO follows logically, it is not a pre-condition. (There's a fun philosophical debate here, where exactly the change points are? In the case of a series of git commits, it is clear that the change point is the commit that causes the regression/improvement. Others could argue change is what happens in the gaps between the points of measurement.) In Otava we index change points from 1 to T, because this way they match with their corresponding test result in the input series. So you could make the conclusion Otava is in the camp that change happens, or is observed at least, at the point after the change.
Anyway, should we add an invariant here that index > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I will correct comments, so they consistently talk in term of slices, not indexes. (I will keep indexes in calculator.py because it's explicit there where we switch from slices to indexes + in my opinion, formulas are easier to follow in index notations there)
- I was thinking of doing that in the next PR, when I make a better integration across the otava. One of the criteria of the current implementation was to minimize changes to the existing code in otava for easier review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with 2.
| change_points[index] = tester.change_point(cp.to_candidate(), series, intervals) | ||
|
|
||
| recompute(weakest_cp_index) | ||
| recompute(weakest_cp_index + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we test for the case where weakest_cp_index == max(index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't. Good catch! It should be
if weakest_cp_index == len(change_points):
recompute(weakest_cp_index - 1)
else:
recompute(weakest_cp_index)
recompute(weakest_cp_index + 1)| kappas = np.arange(start + 2, end + 2)[None, :] | ||
|
|
||
| A = np.zeros((end - start, end - start)) | ||
| A_coefs = 2 / (kappas - start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for discussion... but it is my opinion that these coefficient are added to the formula only to pass some robustness "almost certainly at the limit" proof. They are never really explained or justified the way every other term is. They have the effect of muting the q values at the ends of a interval, and significantly inflating those in the midldle. As a result, given many good candidates q, the algorithm tends to pick change points first in the middle of a series.
Anyway, it's for the future, but to me those are free game once we want to make any mods to this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They scale the values of the estimate to be unbiased (in statistical sense). I strongly suggest keeping them if we care about theoretical support behind the methods. With that being said, the Hunter paper introduces the use of t-test without any theory behind it (unless I missed something), and it's still being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some value in the theoretical soundness yes. And you are correct that the use of t-test was purely based on empirical observations coupled with subjective judgement = it's faster, deterministic and finds more things that upon inspection a human agrees were valid bugs. (The hunter paper also introduced a method for generating real world but still objective test data sets, but the chronology of development is that changes to the algorithm were done first.) That said the project was done by two PhD's, one of which was in math, and they tested other significance tests too. So it wasn't as uneducated as just me looking at some graphs and picking the one I like. (My above comment is in that category for sure, and like I said, this is just discussion.)
And theoretically speaking I think the t-test is wrong, because performance test results are not known to be normally distributed. Unless of course the thinking is that ultimately everything is.
| C[:-1, 1:] = C_coefs[:-1, 1:] * np.flipud(np.cumsum(np.flipud(H[1:, 1:]), axis=0)) | ||
|
|
||
| # Element of matrix `Q_{i, j}` is equal to `Q(τ, κ) = Q(i + 1, j + 2) = QQ(sequence[start : i + 1], sequence[i + 1 : j + 2])`. | ||
| # So, critical point is `τ = i + 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that you end up with parameters i+1, j+2 here, rather than the more common i, j+1, suggests to me you need to shift your indexing to the left one step. (0,T) or )0,T) not (1,T+1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't have time to do that in the near future, the shifting of indexes can be a separat followup task too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is because values of Q is shifted with respect to the series. Function Q is defined on non-empty consecutive subseries, so the shortest possible subsequences are Q(series[0:1], series[1:2]), which would correspond to Q[0, 0]. The reason I didn't keep Q[i, j] ~ Q(series[0:i], series[i:j] was to reduce matrix sizes by cutting out columns and rows with only zeros. I guess I tried optimizing what I could :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you would prefer to pad matrices with zeros for the sake of indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is ok. I think it's clearer now on second read.
|
|
||
| def get_candidate_change_point(self, interval: slice) -> CandidateChangePoint: | ||
| '''For a given `slice(start, stop)` finds potential critical point in subsequence series[slice], | ||
| i.e., from index `start` to `stop - 1` inclusive. For simplicity, we'll use `end = stop - 1`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be super clear, maybe use "interval" and ,) if you are talking math, and slice with [] and range with () in python context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
|
|
||
| Q = self._get_Q_vals(start, end) | ||
| i, j = np.unravel_index(np.argmax(Q), Q.shape) | ||
| return CandidateChangePoint(index=i + 1 + start, qhat=Q[i][j]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this feels reassuringly familiar <3
Thank you for the flattering review :) Regarding your question:
It's kind of a loaded question, but the short answer is no. However, I think you'll be interested in the long answer. It's not exactly the same because as we add a point to the end of series it might cause a different point to become the best candidate. The minimal example that I came up with is >>> series = np.array([0, 29, 60])
>>> calculator.get_next_candidate(slice(0, None)) # whole series
CandidateChangePoint(index=2, qhat=41.33333333333333) # its x_2 = 60However, >>> series = np.array([0, 29, 60, 27])
>>> calculator.get_next_candidate(slice(0, None)) # whole series
CandidateChangePoint(index=1, qhat=41.5) # its x_1 = 29Now, the intuition here is that when we had only three points, the jump Next, if I understand correctly why you are asking this question, it is because of the issues described in the Hunter paper. Namely, the claim:
I tried to generate data similar to the one on the picture and run a few tests: >>> def figure1_test(N):
... base = 440 + np.random.randn(N) * 5
... drop = 400 + np.random.randn(N) * 5
... recover = 445 + np.random.randn(N) * 5
... series = np.concatenate((base, drop, recover))
... tester = PermutationsSignificanceTester(alpha=0.00001, permurations=100, calculator=PairDistanceCalculator, seed=1)
... detector = ChangePointDetector(tester, PairDistanceCalculator)
... points = detector.get_change_points(series)
... return [p.index for p in points]
...
>>> figure1_test(10)
[10, 20]
>>> figure1_test(100)
[100, 200]
>>> figure1_test(1000) # took quite a while because of the permutation tester
[1000, 2000]As you can see (1) the change points are correctly identified for this pattern (2) on wide range of sequence length. |
When asking the question, I had slightly misunderstood where this happens: This is about generating the set of Q-values, not the set of change points found. (weak or regular...) So I think the correct use of kappa increases the set of q-values, and therefore candidate change points, so that the change points that the Hunter paper describes as missing, or disappearing rather, could be found. But you're right that only the best one will be picked, and then of course in the next iteration things have changed, so I guess it is not guaranteed that the variating of kappa will generate all the same change points as would be found by computing the algorithm over all {series[:1], series[:2] ... series[:N]} and just keeping the union of all change points. Even so, the effect of: 0 < tau < kappa <= T, where kappa goes from 2 to T (your implementation) seems to me very close to 0 < tau < kappa = t, where t goes from 2 to T (my question) But as you point out, in the first case we may not actually pick all the change points that would be generated in the second case. But I feel like the potential is there, as the first case should generate the same "peaks" of q-values, but it's not guaranteed, only more likely.
My motivation for the question was to understand whether this fully explains the phenomenon of change points that first are found and then disappear. It seems to me it mostly does, but we cannot say for certain it "fully" does so in all scenarios.
I think to generate a data set that the hunter paper was concerned with, you need the drop to be short, maybe even 1-2 only: |
Correct me If I'm wrong, but my understanding was that there are two separate problems:
|
No, these are the same problem. The change points disappear when the interval/ window they are in, grows larger. I always assumed this was a feature: In a short timeseries, say 50-100 points, MongoDB e-divisive with typical parameters would ignore spikes that last a single point only, and might alert for a plateu of 2-3 points that then returns to the original level. (But even then would only produce 1 change point, because original MongoDB implementation needed a hard coded 3 points before it would alert anything at all, so it is not possible to find 2 neighboring change points. This is from the Matteson paper and their R reference implementation I believe defaulted to a leading 30 points or so. Which would be a long time to wait for a jira ticket if it was nightly builds!) ...where was I... So then if the series keeps growing , my interpretation is that the short lived change becomes less significant compared to the entire series, so eventually it is ignored by the algorithm, just as if it was a single point. Conversely, also a single point could trigger an alert if it was large enough. (At least assuming that the series on both of its sides aren't perfectly constant.) The fix of adding a window is based on the above understanding: it creates a situation where the local computation doesn't take into account more than a small number of local points. And this is why I asked earlier whether Kappa is now equivalent to observing a series grow from 1 point and computing the algorithm for every added point. |
|
I see, thank you for clarification. I'll need to think about it. |
…o match slice object notations.
|
|
I didn't fix the |
|
Thank you @Sowiks , appreciate the attention to commenting the algorithm as understandable as possible, and attention to detail in keeping a certain consistency in both variable naming and indexing. You already have my approval from the previous review, but wanted to re-affirm it here that I don't think I have any further comments on this PR. We should however wait until Tuesday to hear from @Gerrrr how we proceed with the 0.7.0 release. I'm sensing we might create a separate branch for the backward compatible releases. If not, the we'll just have to wait a few weeks while we iterate through the ASF voting process to get the release out of the way. |
First question is whether this split-merge-recompute is even needed after the re-implementation. If the problem that it fixes goes away with your addition of Kappa, then we should remove it. |
|
Great work, @Sowiks! FYI I cut a separate branch for the next 0.7.0 release - https://github.com/apache/otava/tree/0.7, so feel free to merge this PR whenever you are ready. |
|
I guess we'll have to merge it, but @Sowiks do you have the contributor agreement (ICLA) signed with the ASF? (I didn't find a place where I could check myself.) |
|
I checked and an ICLA has not yet been filed. |
|
@Sowiks : https://www.apache.org/licenses/contributor-agreements.html Short version is, download the ICLA pdf, sign it either with a pen or GnuPG, email it to secretary@apache.org |
|
@dave2wave @michaelsembwever @henrikingo to my knowledge, a casual contributor does not have to sign ICLA. For example, I recall signing ICLA only right before becoming Apache Cassandra committer. Isn't it so? |
|
At least the guide I link to above uses words like "all contributors" and "every developer". We should maybe do this on the mailing list if you want to debate it more thoroughly. |
|
Sent the form. |
|
Yay! |

This is a PR request to replace mongodb
signal_processing_algorithmspackage with internal implementation as discussed in https://lists.apache.org/thread/4vwp79kmsjd3zbf4fjcgkggf33jot65c . I tried to do minimal changes to existing code (analysis.pyandseries.py). Better integration is possible, but it can wait till next PR.There is, however, an issue that I identified during my implementation of the methodology from A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data [Matteson and James](https://arxiv.org/abs/1306.4933). Long story short, it doesn't seem that
signal_processing_algorithms==1.3.5has correct implementation. Namely, let's look at section 2.2 Estimating the Location of a Change Point from the paper, more specifically formula (7) and the following discussion in the last paragraph of that section. I add them here for your convenience:The main idea of that section is to allow$\kappa$ to vary, not to be simply set to the end of the series $\kappa=T$ . However, when I implemented the methodology as in the paper (allowing $\tau < \kappa \leq T$ to vary) $\kappa=T$ ) resolves the issues with $\kappa$ is fixed at $T$ (at least version 1.3.5). Moreover, this would also explain arguments from Hunter: Using Change Point Detection to Hunt for Performance Regressions [Fleming et al.] that caught my eye. In the section 3.3 Fixed-Sized Windows the authors say:
tigerbeetletests failed. With little experimentation I found that erroneous implementation (with fixedtigerbeetletests, which is unlikely to be a coincidence. I think this is becausesignal_processing_algorithmspackage has a mistake/typo in it whereThe issues they discussed in that section seems to be related to the same idea: if you don't allow$\kappa$ to vary, the algorithm might miss some change points if they are within the interval. I wonder if they also used fixed $\kappa$ instead, which lead to the described issues.
Nevertheless, going back to this PR. It contains
twothree commits:signal_processing_algorithmsin all tests. If there is an error in my logic somewhere, this is the commit to replace thesignal_processing_algorithmspackage.tigerbeetletests, which was corrected here.Finally, I included some visualization of
tigerbeetletests for commit-1 vs commit-2 for you to see if they make sense.