Refactor analysis processing to reduce script memory footprint#11
Open
Refactor analysis processing to reduce script memory footprint#11
Conversation
Add a class dedicated to extracting SV data from a file. This class aims to replace the current "extract_sv()" function which loads the entire SV data file in memory to parse it. As SV data file can contain multiple gigabytes of data with long latency tests or with multiple recorded SV channels, this approach isn't scalable with most hardware and lead to Out Of Memory errors. This new SvExtractor class offers an "extract_sv()" method that can provided only part of the data. It parses a given number of SV iteration from the data file, starting where the previous call to "extract_sv()" stopped. This will allow refactoring the data process to sequentially load and analyze only parts of the data instead of loading everything once. This should reduce the RAM footprint of the script. Note that the concept of "SV iteration" comes from the sv-timestamp-logger tool [1] which is used to generate the SV data files. [1]: https://github.com/seapath/sv_timestamp_logger Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Replace the old function "extract_sv()" by the new SvExtractor class. For the moment, SV data files are still fully loaded in memory. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
In "verify_sv_logs_consistency()", remove checking if both SV data struct contain the same number of stream array. "SvExtractor.extract_sv()" will always return a data array for each stream given in the "streams" parameter, even if for a given stream no SV was parsed. In that case, the data array is simply empty. Therefore, the check in "verify_sv_logs_consistency()" is unnecessary. If stream data is empty for example in the subscriber data compared to the publisher then it simply means that the full stream data is considered as SV drops. It doesn't prevent analyzing the rest of the data. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
To verify the consistency of SV data, use the utility program "tail" to efficiently read the end of both file and only compare the end iteration. This removes the need to get the full parsed and loaded data, which won't be possible in a future commit to improve the script efficiency. Also, there is no need to check the last iteration for each SV stream, checking the iteration of the last line is enough. In sv-timestamp-logger [1], the SV iteration is globally incremented whenever the smpCnt is detected to "go backward". There is no concept of "per stream SV iteration". Therefore, if the SV iteration of two files differ, it can be detected only with the last line. [1] https://github.com/seapath/sv_timestamp_logger/blob/514dfe82341f9c46ce65e9851df7673dba34eecd/sv_timestamp_logger.c#L189 Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
There is no need to parse the SV data if it isn't coherent. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
At the end of data processing, store the latencies in pandas DataFrame instead of only using raw numpy array. That way, instead of storing every SV latency differently, store the total count of observed latency for each latency value. In future commits, processing logic will be refactored to process SV data by chunks instead of loading in RAM and processing all the data once. Therefore, to limit the memory usage for very large data set, global latency storage must be refactored to only store the total count of observed latencies of a specific value. Pandas DataFrame are a good way to store such data, as it will also ease in the future merging latency result of different data chunks. Also adapt the save_latency_histogram() to work with latency DataFrames. For better code organisation, also make this function generate a plot for a single SV stream. The calling context is then responsible for calling it for every SV stream. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
5b5299d to
523ccb4
Compare
Instead of processing the entire SV data at once, process it chunk by chunk to limit the RAM footprint of the script. Latencies are now computed chunk by chunk and results are cumulated in DataFrames to limit the size of the stored latency data. For now, the chunk size is set to 100 SV iterations. An option to configure it will be added in a future commit. Also, to avoid making too many changes in one commit, only change the processing of the SV data file and the latency processing. This means that the report won't generate correctly for now. Pacing as well as adapting the report generation will be fixed in a future commit. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Re-introduce pacing computation with the new chunk by chunk strategy. It is handled like the latencies. * Global pacing counts are stored per stream in pandas DataFrame. This will limit the memory footprint of the result storage with very large data sets. * An additional parameter "prepend" is added to compute_pacing(). For a given chunk of SV data, this allows giving the last timestamps of the previous chunk to compute the pacing between data chunks. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
This operation is done for latency and pacing, and will also be used in a future commit to process hypervisor SV data with publisher SV data. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Move processing of the hypervisors SV data with the subscriber and publisher processing. That way, all the SV data is processed by chunk, which greatly reduces the memory footprint of the script. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Adapt report generation with the latest changes done to the data processing and the data structure storing result values. Especially adapt minimum, maximum and average value computation as we don't use arrays with the full latencies and pacing value points. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
* Add function docstring. * Use zip() to iterate over multiple list of the same length instead of relying on indexing. That way, zip() will raise immediately a ValueError if the two lists are not of the same length. * Fix typos in comment. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
Add the --processing-window to control the size in SV iteration of the SV data to compute at once. A lower value reduces the memory usage of the script. Signed-off-by: Elinor Montmasson <elinor.montmasson@savoirfairelinux.com>
523ccb4 to
01009e3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve the sv-timestamp-analysis script to process large amounts of data.
This script's input data is an exhaustive list of timestamps of each IEC 61850 SV packets that were sent by a publisher are received by a subscriber machine during a SEAPATH latency test.
Because a single SV channel sends over 4000 SV packets per second and that data is saved as text, resulting data files can be quite large.
For example the file size for 15 mins of SV data with a single stream is roughly 106 Mb.
With tests of multiple hours (e.g. 5H, 8H, ...), data files would weight multiple Gigabytes.
Moreover, the amount of data is doubled for a SV latency test, as SVs are recorded on both publisher and subscriber sides.
Currently, when the sv-timestamp-analysis script processes these data files, it loads them once entirely in RAM, then process the data. This does not scale well with large volume of data from hours long SV latency tests.
Having a script loading multiple gigabytes of data in RAM isn't desirable.
This PR refactor in-depth the way SV data are read and processed.
Chunk boundaries are defined by SV iteration boundaries, as sv drops can only be recovered in a given SV iteration.
Instead of keeping every data point in RAM (e.g. storing 10,000 integers for 150 µs latencies), keep only the total count each latency value happened. (e.g. storing that 150 µs latency happened 10,000, only needs 2 integers).
The former approach is indeed extremely inefficient for long latency tests. For example, with a 5H long test for a single SV stream,
5*60*60*4,000=72,000,000SV will be sent and the same number of latencies will be computed.Assuming values in microseconds are stored on 64 bits integers (the default for numpy arrays of integers), this would be equivalent to
(72,000,000*8) = 576,000,000 bytes ~= 549 Mb, where most of the latencies are equals.less, instead of requiring the entire SV data.Note that part of this PR temporarily breaks the script while refactoring the processing logic, to allow incrementally adding changes to avoid large commits. But the PR in itself preserve all functionalities of the script.