Refine the doubling logic #418#420
Conversation
duyhuynhdev
commented
Dec 10, 2025
- Update the doubling logic according to the first approach, as the second approach still has corner cases where a single log event can dominate others if it has a large event count.
- Remove the timer callback and use the profiler flush callback instead. In addition, expose the new maxsample setting to support the flush callback.
- Update the relevant classes and unit tests to align with the new workflow and doubling logic.
- Add new unit tests to cover the updated behaviour.
- Fix CI complaints, including unsorted language files.
67e5801 to
70663e9
Compare
There was a problem hiding this comment.
Thanks @duyhuynhdev
This is a big change so we're going to need to test this thoroughly. I've started with comments from a code review, but I haven't tested this personally yet.
My main concerns so far is how this affects task detection and whether we are losing any sample data when we change the structure while combining components.
| // We want to prevent doubling up of processing, so skip if an existing process is still executing. | ||
| // The profile logs will be kept and processed the next time. | ||
| self::$logs[] = $log; | ||
| $this->logcount += $log->count(); | ||
| // Doubling sampling period if it reaches the limit. | ||
| if ($this->logcount >= $this->samplelimit) { | ||
| $this->on_reach_limit($manager); | ||
| $this->logcount = $this->logcount - $this->samplelimit; | ||
| } | ||
| if (self::$alreadyprofiling) { | ||
| debugging('tool_excimer: starting cron_processor::on_interval when previous call has not yet finished'); | ||
| if ($isfinal) { | ||
| // The final flush call when profiler is destroyed. | ||
| if ($log->count() < $this->samplelimit) { | ||
| // This should never happen. | ||
| debugging('tool_excimer: alreadyprofiling is true during final on_interval.'); | ||
| } | ||
| return; | ||
| } | ||
| self::$alreadyprofiling = true; |
There was a problem hiding this comment.
The logic to prevent double up of processing may not be needed anymore now that we've switched from timers to flush callbacks (I'm not sure if this is paused during the callbacks). I think it's worth looking into further because it would be great if we could remove $self::logs entirely.
There was a problem hiding this comment.
At this stage, the exact underlying scenario isn’t clear because the documentation doesn’t cover this case. However, this likely happens when maxSamples is set too low, causing the process to continue running while callbacks are triggered repeatedly. In this scenario, self::logs can be used for stored new coming samples.
There was a problem hiding this comment.
Previously there were edge cases of the 10s interval overlapping, see #377
With your changes these should be much rarer as in the worst case the number of possible samples in on_interval should be very low. I'm not even sure if the profiler would be taking samples here.
I'm interested in what happens when max samples is set too low. The previous issue this fixed shouldn't be a concern if it only happens with bad config, but is there a chance of endless loops if this is set to something ridiculous like 1?
There was a problem hiding this comment.
Yes, it should be possible, because we don’t have any constraint on the maxSamples value. That’s why I introduced the $logs array as a waiting list. New logs from the callback are collected in this array until the current process completes. I’m not sure how often overlaps will occur with the new logic, but $logs won’t impact the existing flow. On the contrary, it helps mitigate overlapping if it does occur.
| public function process($log, manager $manager, bool $isfinal = false) { | ||
| // We want to prevent overlapping of processing, so skip if an existing process is still executing. | ||
| // The profile logs will be kept and processed the next time. | ||
| self::$logs[] = $log; |
There was a problem hiding this comment.
See comments about overlapping processing in cron.
|
|
||
| // Doubling sampling period if it reaches the limit. | ||
| $this->logcount += $log->count(); | ||
| if ($this->partialsave && $this->logcount >= $this->samplelimit) { |
There was a problem hiding this comment.
Do we want this applied to non-partial saves too? I believe it was previously applied there as well.
There was a problem hiding this comment.
We only get sampling once with non non-partial saves which means we won't set the callback for non-partial saves ( as previous) so we the sampling period doubling cannot be applied this case. However, the merge is still applied when we process the samples.
There was a problem hiding this comment.
Thanks, that clarifies things and looks correct.
In that case, what happens if partial save is enabled and this is called by the flush callback when the ExcimerProfiler object is destroyed? Restarting it there seems bad.
There was a problem hiding this comment.
$this->logcount is used to manage increases to the samplingperiod.Even in the final flush callback, the only affected element is the samplingperiod. So, it should not impact any profile, because the samplerate is no longer derived from the samplingperiod.
There was a problem hiding this comment.
I'm less concerned about the profile and more about the harm of restarting the profiler while it's being destroyed. It could be handled gracefully behind the scenes, but we should be able to detect this ourselves and explicitly handle the logic for not restarting it in this case.
There was a problem hiding this comment.
Thanks, I separated the on_flush and profile process functions.
bwalkerl
left a comment
There was a problem hiding this comment.
I have a couple more comments from testing. The transformations of the graphs after merging also doesn't look right to me (tested with partial saves to see before and after), so will need to test that more.
| $manager->get_timer()->setCallback(function () use ($manager) { | ||
| $manager->get_profiler()->setFlushCallback(function ($log) use ($manager) { | ||
| // Once overlapping has happened once, we prevent all future partial saving. | ||
| if (!$this->hasoverlapped) { | ||
| $this->process($manager, false); | ||
| $this->process($log, $manager); | ||
| } | ||
| }); | ||
| }, $this->maxsamples); |
There was a problem hiding this comment.
With this change partial save never marks the profile as finished.
I'm also thinking that with this patch it might be OK to re-enable partial saves by default (and lighten the old warnings) as the number of samples won't rise quickly when the DB is having issues, which was the main reason we disabled it..
There was a problem hiding this comment.
Yeah, it is a bug. Let me fix it
There was a problem hiding this comment.
After the update partial save is no longer saving partial profiles every time there is a flush callback.
There was a problem hiding this comment.
I will update it in the next commit
There was a problem hiding this comment.
Now it's updating (or rather, will once you fix the typo ')'), but it will save the last profile twice. We want to avoid this redundancy - we shouldn't be adding extra DB updates when they're not needed.
There was a problem hiding this comment.
Hey Ben, can you explain more about the typo and saving the last profile twice issue?
| $string['field_month'] = 'Month'; | ||
| $string['field_name'] = 'Name'; | ||
| $string['field_numsamples'] = 'Number of samples'; | ||
| $string['field_numsamples_value'] = '{$a->samples} samples ({$a->events} events) @ ~{$a->samplerate}ms'; |
There was a problem hiding this comment.
We need to make this consistent with the hover display on the graph, which currently only uses 'samples' for events. We probably need to iron out the terminology since samples will have more meaning than events to most ends users.
I also think it would be better to keep the old display without events when the number of events is 0 (old profiles etc)
There was a problem hiding this comment.
number of events should not be 0. If there is no numevents it will use the numsamples instead.
'events' => number_format($data['numevents'] ?? $data['numsamples'], 0, $decsep, $thousandssep),
There was a problem hiding this comment.
Old profiles prior to the upgrade are showing as 0 in the database to me, but this isn't a big concern as old profiles will be flushed out eventually. That main issue here is making the terminology consistent with the graph.
70663e9 to
6137ab7
Compare
6137ab7 to
52cfb71
Compare
bwalkerl
left a comment
There was a problem hiding this comment.
Thanks for the updates @duyhuynhdev
I've left a couple more comments, plus responses to some of your other comments that have actions.
classes/sample_set.php
Outdated
| $trace = $this->samples[$i + 1]['trace']; | ||
| } | ||
| $newsamples[] = [ | ||
| 'eventcount' => ceil(($this->samples[$i]['eventcount'] + $this->samples[$i + 1]['eventcount']) / 2), |
There was a problem hiding this comment.
Is ceiling preferred here? Always rounding up will increase the margin of error. The number of samples won't be perfect unless we allow decimals, but maybe we can introduce some logic here to keep it more balanced.
I also still think we need more comments about what we're doing here.
There was a problem hiding this comment.
Sorry. I've updated it to the round() function. I intended to use the round function but somehow used the ceil() function instead.
There was a problem hiding this comment.
Round makes more sense, but it's going to run into the same problems as it will always be .5.
There was a problem hiding this comment.
Is rounding to the nearest 0.5 a problem? For example, rounding 3.5 to 4.0 or 3.4 to 3.0 is acceptable to me, because the small difference should not materially affect the analysis.
There was a problem hiding this comment.
By itself it's not an issue, but since this will always be rounding up the number of samples is going to be slightly overestimated which can have flow on effects to duration estimates.
I'm not sure how much of a problem this is in practice - if there's only one or two with more samples it's not an issue, but if say a tenth of the samples have this then it could have a larger impact.
d246be2 to
ecee240
Compare
|
Closing this in favour of #423 - This PR does still have some different code that can be reused in the future. |