Skip to content

[Performance] Virtual memory addresses reusage in a filter chain frame buffers #445

@DTL2020

Description

@DTL2020

As was answered at
https://stackoverflow.com/questions/79633556/
https://stackoverflow.com/questions/34038241/
and hinted at https://stackoverflow.com/questions/31023260 :

  1. There is no yet valid (easy) way to invalidate cache lines from user mode (also it can cause significant system performance drop or even non-stability if used with any errors).

  2. The only legal way to decrease lots of useless store traffic on a host slow RAM bus is to reuse the same virtual addresses (mapped to the same cache lines). And decrease time of CPU stall on store operations waiting for some free lines in cache for download already read and no more useful data to host RAM. Because the CPU can not overwrite dirty cache lines and data production rate from CPU core with SIMD is much larger in comparison with the download speed from caches to RAM. So when free to use cache lines are exhausted store operations finally stall and no optimization of the processing function can help to performance.

  3. It is really some sad fact that in 202x years and 64 bit addressing of 'flat' memory model - for high-performance computing we really have only very small (sum of) used virtual address space available of the total size about CPU cache size only (except Xeon with integrated HBM RAM onboard). If actively used total virtual address space is significantly larger - the program will get significant performance impact from slow RAM bus performance. And in some use cases it is even busy with completely useless store traffic as in the example of the chained AVS filters when each filter can produce store traffic into a separated virtual address buffer (with current AVS frame cache design ?).

Some hints provided at stackoverflow answer to make programming solutions somehow easier (in comparison with making its own virtual memory manager in AVS core):
 On free() TCMalloc just puts a freed block of memory in its cache. And this block of memory will be returned by malloc() when an application asks for a block of memory next time. That is the essence of a caching alliocator like TCMalloc.

But it looks unfortunately the memory library TCMalloc is only for Linux platform and no Windows support. So it is subject to research if we have the same functions in other possible malloc/free C API implementations provided for Windows compilers (MSVC ?, clang ? intel C ?).

Possible solution to test with AVS chained filters:
Source_filter(output_buff1)
Process1_filter(ouput_buff2)
Process2_filter(output_buff3)
...
ProcessN_filter(output_buffN)

If all buffers are pre-allocated and always allocated in common virtual address space - this causes sequential data re-writing from buff1 via buff2 to buffN and total virtual address space used buff1+buff2+..+buffN. This may be much larger than the largest CPU cache and each frame motion from buff1 to buffN will cause each filter interconnection data downloading from cache to host RAM. But a pair of enough for even 2 buffers processing buff1+buff2 may fit in CPU caches.
So if we make dynamically allocated pairs (or even single buffer for transform in place filters in best case) of buffers only required in currently active (data processing filter) we may got some performance benefit. But the used malloc/free library must be common for all used filters in a chain and provide as possible the same virtual address range (same as we have in TCMalloc library) for each new allocation. This expects to make filter design more simple in comparison with sharing the same lowest possible pair of buffers in virtual address space between all filters in a chain.

The next step in data-locality optimization is making (at least core) filters able to process only some limited in size part of a frame and send it via all chained sequence of filters (if all/some sequential filters support such frame partitioning). For filters without spatial processing it is typically possible (like Levels/ColorYUV/Tweak and some others). Also with data sent in between filters via a small (about 1/2 of CPU cache or lower) sized buffer (freed and allocated again for the next pair of filters to run).

Short description of the issue:

  1. Small enough temporal writes from the CPU core can be cached (about 1/2 of total CPU caches size or less) at least at full cache write speed and not cause too long CPU stall.
  2. Written data to the cache can not be discarded (from downloading to host RAM) by software instruction from user mode after it was totally read and no more required for next processing.
  3. Too much writes to different virtual addresses (several temporal buffers in a sequence) will cause free cachelines exhausting and final CPU stall after write buffers are full and wait for the cached data download to host RAM.
  4. The only way to save from useless writes to host RAM of the short-live temporal data (from internals of filters and immediate data in between filters in a chain) is constant reusing of the same virtual addresses for store buffers. It is the task for both AVS+ core memory management design and filters design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions