Refactor analysis processing to reduce script memory footprint by emontmas · Pull Request #11 · seapath/sv-timestamp-analysis

emontmas · 2026-03-04T13:23:59Z

Improve the sv-timestamp-analysis script to process large amounts of data.

This script's input data is an exhaustive list of timestamps of each IEC 61850 SV packets that were sent by a publisher are received by a subscriber machine during a SEAPATH latency test.
Because a single SV channel sends over 4000 SV packets per second and that data is saved as text, resulting data files can be quite large.
For example the file size for 15 mins of SV data with a single stream is roughly 106 Mb.
With tests of multiple hours (e.g. 5H, 8H, ...), data files would weight multiple Gigabytes.
Moreover, the amount of data is doubled for a SV latency test, as SVs are recorded on both publisher and subscriber sides.

Currently, when the sv-timestamp-analysis script processes these data files, it loads them once entirely in RAM, then process the data. This does not scale well with large volume of data from hours long SV latency tests.
Having a script loading multiple gigabytes of data in RAM isn't desirable.

This PR refactor in-depth the way SV data are read and processed.

Instead of reading the full SV data (both publisher and subscriber), process by parts / chunks of the input data. That way the script only needs the current chunk of data to be loaded in RAM.
Chunk boundaries are defined by SV iteration boundaries, as sv drops can only be recovered in a given SV iteration.
Refactor the way latencies and pacing results are stored.
Instead of keeping every data point in RAM (e.g. storing 10,000 integers for 150 µs latencies), keep only the total count each latency value happened. (e.g. storing that 150 µs latency happened 10,000, only needs 2 integers).
The former approach is indeed extremely inefficient for long latency tests. For example, with a 5H long test for a single SV stream, 5*60*60*4,000=72,000,000 SV will be sent and the same number of latencies will be computed.
Assuming values in microseconds are stored on 64 bits integers (the default for numpy arrays of integers), this would be equivalent to (72,000,000*8) = 576,000,000 bytes ~= 549 Mb, where most of the latencies are equals.
- Use pandas DataFrame to easily store and increment latencies and pacing counts across chunks of data with a minimal memory footprint.
Remove unneeded SV data validity check
Check SV iteration consistency in SV data by reading the end of the files with less, instead of requiring the entire SV data.
Make various improvements to code and docstring to improve maintainability of the script.

Note that part of this PR temporarily breaks the script while refactoring the processing logic, to allow incrementally adding changes to avoid large commits. But the PR in itself preserve all functionalities of the script.

Add a class dedicated to extracting SV data from a file. This class aims to replace the current "extract_sv()" function which loads the entire SV data file in memory to parse it. As SV data file can contain multiple gigabytes of data with long latency tests or with multiple recorded SV channels, this approach isn't scalable with most hardware and lead to Out Of Memory errors. This new SvExtractor class offers an "extract_sv()" method that can provided only part of the data. It parses a given number of SV iteration from the data file, starting where the previous call to "extract_sv()" stopped. This will allow refactoring the data process to sequentially load and analyze only parts of the data instead of loading everything once. This should reduce the RAM footprint of the script. Note that the concept of "SV iteration" comes from the sv-timestamp-logger tool [1] which is used to generate the SV data files. [1]: https://github.com/seapath/sv_timestamp_logger Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Replace the old function "extract_sv()" by the new SvExtractor class. For the moment, SV data files are still fully loaded in memory. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

In "verify_sv_logs_consistency()", remove checking if both SV data struct contain the same number of stream array. "SvExtractor.extract_sv()" will always return a data array for each stream given in the "streams" parameter, even if for a given stream no SV was parsed. In that case, the data array is simply empty. Therefore, the check in "verify_sv_logs_consistency()" is unnecessary. If stream data is empty for example in the subscriber data compared to the publisher then it simply means that the full stream data is considered as SV drops. It doesn't prevent analyzing the rest of the data. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

To verify the consistency of SV data, use the utility program "tail" to efficiently read the end of both file and only compare the end iteration. This removes the need to get the full parsed and loaded data, which won't be possible in a future commit to improve the script efficiency. Also, there is no need to check the last iteration for each SV stream, checking the iteration of the last line is enough. In sv-timestamp-logger [1], the SV iteration is globally incremented whenever the smpCnt is detected to "go backward". There is no concept of "per stream SV iteration". Therefore, if the SV iteration of two files differ, it can be detected only with the last line. [1] https://github.com/seapath/sv_timestamp_logger/blob/514dfe82341f9c46ce65e9851df7673dba34eecd/sv_timestamp_logger.c#L189 Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

There is no need to parse the SV data if it isn't coherent. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

At the end of data processing, store the latencies in pandas DataFrame instead of only using raw numpy array. That way, instead of storing every SV latency differently, store the total count of observed latency for each latency value. In future commits, processing logic will be refactored to process SV data by chunks instead of loading in RAM and processing all the data once. Therefore, to limit the memory usage for very large data set, global latency storage must be refactored to only store the total count of observed latencies of a specific value. Pandas DataFrame are a good way to store such data, as it will also ease in the future merging latency result of different data chunks. Also adapt the save_latency_histogram() to work with latency DataFrames. For better code organisation, also make this function generate a plot for a single SV stream. The calling context is then responsible for calling it for every SV stream. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Instead of processing the entire SV data at once, process it chunk by chunk to limit the RAM footprint of the script. Latencies are now computed chunk by chunk and results are cumulated in DataFrames to limit the size of the stored latency data. For now, the chunk size is set to 100 SV iterations. An option to configure it will be added in a future commit. Also, to avoid making too many changes in one commit, only change the processing of the SV data file and the latency processing. This means that the report won't generate correctly for now. Pacing as well as adapting the report generation will be fixed in a future commit. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Re-introduce pacing computation with the new chunk by chunk strategy. It is handled like the latencies. * Global pacing counts are stored per stream in pandas DataFrame. This will limit the memory footprint of the result storage with very large data sets. * An additional parameter "prepend" is added to compute_pacing(). For a given chunk of SV data, this allows giving the last timestamps of the previous chunk to compute the pacing between data chunks. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

This operation is done for latency and pacing, and will also be used in a future commit to process hypervisor SV data with publisher SV data. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Move processing of the hypervisors SV data with the subscriber and publisher processing. That way, all the SV data is processed by chunk, which greatly reduces the memory footprint of the script. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Adapt report generation with the latest changes done to the data processing and the data structure storing result values. Especially adapt minimum, maximum and average value computation as we don't use arrays with the full latencies and pacing value points. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

* Add function docstring. * Use zip() to iterate over multiple list of the same length instead of relying on indexing. That way, zip() will raise immediately a ValueError if the two lists are not of the same length. * Fix typos in comment. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

Add the --processing-window to control the size in SV iteration of the SV data to compute at once. A lower value reduces the memory usage of the script. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

emontmas added 6 commits March 4, 2026 10:11

sv_timestamp_analysis: use new SvExtractor class

facc624

Replace the old function "extract_sv()" by the new SvExtractor class. For the moment, SV data files are still fully loaded in memory. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

sv_timestamp_analysis: check SV data validity before parsing

37f7b4b

There is no need to parse the SV data if it isn't coherent. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

emontmas force-pushed the refactor-analysis-process branch from 5b5299d to 523ccb4 Compare March 4, 2026 14:23

emontmas marked this pull request as ready for review March 4, 2026 14:40

emontmas added 8 commits March 4, 2026 17:37

sv_timestamp_analysis: move DataFrame count merging to function

da3d22f

This operation is done for latency and pacing, and will also be used in a future commit to process hypervisor SV data with publisher SV data. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

sv_timestamp_analysis: use docstring for "handle_sv_drop()"

5d00ec4

Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

sv_timestamp_analysis: add --processing-window option

01009e3

Add the --processing-window to control the size in SV iteration of the SV data to compute at once. A lower value reduces the memory usage of the script. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>

emontmas force-pushed the refactor-analysis-process branch from 523ccb4 to 01009e3 Compare March 4, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor analysis processing to reduce script memory footprint#11

Refactor analysis processing to reduce script memory footprint#11
emontmas wants to merge 14 commits intomainfrom
refactor-analysis-process

emontmas commented Mar 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emontmas commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emontmas commented Mar 4, 2026 •

edited

Loading