Skip to content

Refactor analysis processing to reduce script memory footprint#11

Open
emontmas wants to merge 14 commits intomainfrom
refactor-analysis-process
Open

Refactor analysis processing to reduce script memory footprint#11
emontmas wants to merge 14 commits intomainfrom
refactor-analysis-process

Conversation

@emontmas
Copy link
Contributor

@emontmas emontmas commented Mar 4, 2026

Improve the sv-timestamp-analysis script to process large amounts of data.

This script's input data is an exhaustive list of timestamps of each IEC 61850 SV packets that were sent by a publisher are received by a subscriber machine during a SEAPATH latency test.
Because a single SV channel sends over 4000 SV packets per second and that data is saved as text, resulting data files can be quite large.
For example the file size for 15 mins of SV data with a single stream is roughly 106 Mb.
With tests of multiple hours (e.g. 5H, 8H, ...), data files would weight multiple Gigabytes.
Moreover, the amount of data is doubled for a SV latency test, as SVs are recorded on both publisher and subscriber sides.

Currently, when the sv-timestamp-analysis script processes these data files, it loads them once entirely in RAM, then process the data. This does not scale well with large volume of data from hours long SV latency tests.
Having a script loading multiple gigabytes of data in RAM isn't desirable.

This PR refactor in-depth the way SV data are read and processed.

  • Instead of reading the full SV data (both publisher and subscriber), process by parts / chunks of the input data. That way the script only needs the current chunk of data to be loaded in RAM.
    Chunk boundaries are defined by SV iteration boundaries, as sv drops can only be recovered in a given SV iteration.
  • Refactor the way latencies and pacing results are stored.
    Instead of keeping every data point in RAM (e.g. storing 10,000 integers for 150 µs latencies), keep only the total count each latency value happened. (e.g. storing that 150 µs latency happened 10,000, only needs 2 integers).
    The former approach is indeed extremely inefficient for long latency tests. For example, with a 5H long test for a single SV stream, 5*60*60*4,000=72,000,000 SV will be sent and the same number of latencies will be computed.
    Assuming values in microseconds are stored on 64 bits integers (the default for numpy arrays of integers), this would be equivalent to (72,000,000*8) = 576,000,000 bytes ~= 549 Mb, where most of the latencies are equals.
    • Use pandas DataFrame to easily store and increment latencies and pacing counts across chunks of data with a minimal memory footprint.
  • Remove unneeded SV data validity check
  • Check SV iteration consistency in SV data by reading the end of the files with less, instead of requiring the entire SV data.
  • Make various improvements to code and docstring to improve maintainability of the script.

Note that part of this PR temporarily breaks the script while refactoring the processing logic, to allow incrementally adding changes to avoid large commits. But the PR in itself preserve all functionalities of the script.

emontmas added 6 commits March 4, 2026 10:11
Add a class dedicated to extracting SV data from a file.

This class aims to replace the current "extract_sv()" function which
loads the entire SV data file in memory to parse it.
As SV data file can contain multiple gigabytes of data with long latency
tests or with multiple recorded SV channels, this approach isn't
scalable with most hardware and lead to Out Of Memory errors.

This new SvExtractor class offers an "extract_sv()" method that can
provided only part of the data. It parses a given number of SV iteration
from the data file, starting where the previous call to "extract_sv()"
stopped.
This will allow refactoring the data process to sequentially load and
analyze only parts of the data instead of loading everything once.
This should reduce the RAM footprint of the script.

Note that the concept of "SV iteration" comes from the
sv-timestamp-logger tool [1] which is used to generate the SV data
files.

[1]: https://github.com/seapath/sv_timestamp_logger

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Replace the old function "extract_sv()" by the new SvExtractor class.
For the moment, SV data files are still fully loaded in memory.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
In "verify_sv_logs_consistency()", remove checking if both SV data
struct contain the same number of stream array.

"SvExtractor.extract_sv()" will always return a data array
for each stream given in the "streams" parameter, even if for a given
stream no SV was parsed. In that case, the data array is simply empty.
Therefore, the check in "verify_sv_logs_consistency()" is unnecessary.

If stream data is empty for example in the subscriber data compared to
the publisher then it simply means that the full stream data is
considered as SV drops. It doesn't prevent analyzing the rest of the
data.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
To verify the consistency of SV data, use the utility program "tail"
to efficiently read the end of both file and only compare the end
iteration.

This removes the need to get the full parsed and loaded data, which
won't be possible in a future commit to improve the script efficiency.

Also, there is no need to check the last iteration for each SV stream,
checking the iteration of the last line is enough.
In sv-timestamp-logger [1], the SV iteration is globally incremented
whenever the smpCnt is detected to "go backward". There is no concept of
"per stream SV iteration". Therefore, if the SV iteration of two files
differ, it can be detected only with the last line.

[1] https://github.com/seapath/sv_timestamp_logger/blob/514dfe82341f9c46ce65e9851df7673dba34eecd/sv_timestamp_logger.c#L189

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
There is no need to parse the SV data if it isn't coherent.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
At the end of data processing, store the latencies in pandas DataFrame
instead of only using raw numpy array. That way, instead of storing
every SV latency differently, store the total count of observed latency
for each latency value.

In future commits, processing logic will be refactored to process SV
data by chunks instead of loading in RAM and processing all the data
once. Therefore, to limit the memory usage for very large data set,
global latency storage must be refactored to only store the total count
of observed latencies of a specific value.

Pandas DataFrame are a good way to store such data, as it will also ease
in the future merging latency result of different data chunks.

Also adapt the save_latency_histogram() to work with latency DataFrames.
For better code organisation, also make this function generate a plot
for a single SV stream. The calling context is then responsible for
calling it for every SV stream.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
@emontmas emontmas force-pushed the refactor-analysis-process branch from 5b5299d to 523ccb4 Compare March 4, 2026 14:23
@emontmas emontmas marked this pull request as ready for review March 4, 2026 14:40
emontmas added 8 commits March 4, 2026 17:37
Instead of processing the entire SV data at once, process it chunk by
chunk to limit the RAM footprint of the script.
Latencies are now computed chunk by chunk and results are cumulated in
DataFrames to limit the size of the stored latency data.

For now, the chunk size is set to 100 SV iterations. An option to
configure it will be added in a future commit.

Also, to avoid making too many changes in one commit, only change the
processing of the SV data file and the latency processing.
This means that the report won't generate correctly for now.
Pacing as well as adapting the report generation will be fixed in a
future commit.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Re-introduce pacing computation with the new chunk by chunk strategy.
It is handled like the latencies.

* Global pacing counts are stored per stream in pandas DataFrame.
  This will limit the memory footprint of the result storage with very
  large data sets.
* An additional parameter "prepend" is added to compute_pacing().
  For a given chunk of SV data, this allows giving the last timestamps
  of the previous chunk to compute the pacing between data chunks.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
This operation is done for latency and pacing, and will also be used in
a future commit to process hypervisor SV data with publisher SV data.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Move processing of the hypervisors SV data with the subscriber and
publisher processing. That way, all the SV data is processed by chunk,
which greatly reduces the memory footprint of the script.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Adapt report generation with the latest changes done to the data
processing and the data structure storing result values.
Especially adapt minimum, maximum and average value computation as
we don't use arrays with the full latencies and pacing value points.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
* Add function docstring.
* Use zip() to iterate over multiple list of the same length instead of
  relying on indexing. That way, zip() will raise immediately a
  ValueError if the two lists are not of the same length.
* Fix typos in comment.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Add the --processing-window to control the size in SV iteration
of the SV data to compute at once.
A lower value reduces the memory usage of the script.

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
@emontmas emontmas force-pushed the refactor-analysis-process branch from 523ccb4 to 01009e3 Compare March 4, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant