Skip to content

Prototype for Mimick-based memory recording#14

Draft
cottsay wants to merge 11 commits intorollingfrom
cottsay/mimick
Draft

Prototype for Mimick-based memory recording#14
cottsay wants to merge 11 commits intorollingfrom
cottsay/mimick

Conversation

@cottsay
Copy link
Member

@cottsay cottsay commented Nov 21, 2020

There are several parts in this change, which when combined, yield a better mechanism for memory measurements than we're currently using in this package.

Here's a breakdown of the components:

  1. MimickMemoryManager - This class implements the benchmark::MemoryManager interface by using Mimick to stub out malloc, realloc, and free. Since Mimick is making changes that affect the whole executable, it should be noted that you should NEVER try to use more than one instance of this class at the same time. Internally, Mimick is using un-stubbed variants of these allocation functions it refers to as "vital". The MimickMemoryManager uses these same functions to do the actual allocation operations before recording the relevant statistics.
    It should be noted that we are not currently using the benchmark::MemoryManager model for memory tracking in this package. We're collecting statistics in the test fixture and adding the results as "user counters". One of the biggest advantages of taking the approach presented here is that memory instrumentation isn't active during timing runs, whereas our current approach is impacting allocation performance during timing runs quite substantially.
  2. MimickPerformanceTest - Unlike the currently used PerformanceTest fixture, this fixture isn't responsible for performing any memory recording. Instead, it's used to ensure that the MimickMemoryManager gets registered and unregistered for a particular run. From what I can tell, the Benchmark API makes it look like we're supposed to register the benchmark::MemoryManager before any of the tests run, meaning that all tests, whether they use fixtures or not, would have memory measurements enabled. This fixture essentially bends the API to act the way we're currently expecting memory measurements to be enabled.
    Honestly, because this new implementation should work on all of our supported platforms, I'm tempted to instead introduce a custom version of libbenchmark_main.so that registers MimickMemoryManager for all of the tests, so everything always gets memory measurements and we don't have to use a fixture to turn it on. Unfortunately that's not possible at the moment because the benchmark main library needs to be SHARED, and libmimick.a was specifically compiled STATIC without PIC. We'd need a change to Mimick for sure, but it might be doable.
  3. MemoryAwareConsoleReporter - The benchmark::ConsoleReporter implementation doesn't display any of the memory statistics. This isn't an enormous problem because the JSON reporter DOES include the memory statistics, so they'll get displayed in Jenkins anyway, but I think that seeing the statistics on the console is really important to local development. This implementation augments the run data, and if memory statistics are present, it fakes those statistics as user counters, which ARE displayed by benchmark::ConsoleReporter.
  4. memory_aware_benchmark_main - The only way I could find to inject the MemoryAwareConsoleReporter was to re-implement BENCHMARK_MAIN manually. I don't see too much of a problem doing this, since the process is very straightforward, but there is a downside: the command line arguments include modifiers for the behavior of benchmark::ConsoleReporter, and the flags that get parsed from those arguments aren't exposed in the Benchmark API. Since I'm instantiating my custom reporter class manually, and I don't have the flags that Benchmark parsed, I can't communicate them to the reporter. This isn't a big deal because the flags only seem to affect formatting, and we haven't used them in the past. I'm pretty sure we don't actually have a way to affect those command line arguments without hard-coding them into ament_cmake_google_benchmark anyway.
  5. mimick_utils - There are two things here that are used to improve Mimick's abilities. They might be candidates for cleanup and submission upstream.
    • mmk_wrapper and mmk_stub_create_wrapped are used to create a non-member function that uses a class instance passed through the Mimick stub's context to invoke a public member function on the instance. There is some pretty thick template code here, but I used it as a learning exercise to generalize the concept I had originally implemented manually for each of the three functions we needed to stub.
    • mmk_allocator is a simple C++ allocator that should redirect allocation operations directly to the Mimick "vital" allocation methods, thereby ensuring that those operations aren't redirected to stub functions. I'm using this to make sure that the pointer map that's used for allocation size tracking doesn't affect the memory statistics when it grows.
  6. benchmark_malloc_realloc_mimick - This is really just a proof-of-concept to show that everything mentioned above is working, and measure the performance to compare with our current solution. Since it isn't using the add_performance_test macro, it's running on all platforms. It's mostly a copy/paste from benchmark_malloc_realloc, but I modified it to run a third test where the memory measurements are forced on, so that we can see how much overhead MimickMemoryManager is causing. Keep in mind that registering a benchmark::MemoryManager is not the same as starting it. Normally, the Benchmark test runner specifically starts the benchmark::MemoryManager to do a small number of runs that are not timed. In this "forced" test, I'm turning them on explicitly. That should never happen during typical use, but I wanted to see the performance.

I think we need to do two things before we can move forward here:

  1. We need to decide whether we want to push for an upstream change to benchmark::ConsoleReporter to show the memory statistics or if we're OK with the current solution based on MemoryAwareConsoleReporter and memory_aware_benchmark_main.
  2. We need to decide whether we want to continue using a fixture to enable/disable memory recording. If we opt to go down the path that benchmark::MemoryManager seems to be designed to do, where the memory statistics are always collected, we'll need to figure out how to get Mimick to play nice with SHARED libraries. The fixture-based approach doesn't strictly require that because unlike the libbenchmark_main replacement which must be SHARED, the fixture can be STATIC, but it might be nice to build the fixture as SHARED anyway.
  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@cottsay cottsay added the enhancement New feature or request label Nov 21, 2020
@cottsay cottsay requested a review from brawner November 21, 2020 08:30
@cottsay cottsay self-assigned this Nov 21, 2020
@cottsay cottsay force-pushed the cottsay/mimick branch 2 times, most recently from e61c4b8 to 3d60945 Compare November 22, 2020 19:34
More testing is needed before we switch over to this new memory
recording approach, but I added a test demonstrating how to use it right
now.

The Benchmark MemoryManager system exposes the metrics under different
names, and unfortunately, only the JSON reporter is currently writing
them out.

Signed-off-by: Scott K Logan <logans@cottsay.net>
The provided console reporter in Benchmark doesn't display memory
statistics at all. This change adds two things:
1. A BenchmarkReporter which augments the run data so that the memory
   statistics are shown as user counters (even though they're not).
2. A modified version of libbenchmark_main.so which utilizes (1) in
   place of the default console reporter.

Note that use of (2) means that the command line arguments that augment
the behavior of the console output won't work. The API doesn't provide a
mechanism to get those options after they've been parsed.

Also note that the JSON reporter does process memory statistics, so no
change is necessary there.

Signed-off-by: Scott K Logan <logans@cottsay.net>
@cottsay
Copy link
Member Author

cottsay commented Nov 22, 2020

I had to drop the lock_guards because it appears that Microsoft's implementation performs heap allocation in the constructor or something, without any API for changing to the custom allocator. None of these function should be throwing exceptions anyway, so I just switched it to plain ol' lock()/unlock().

Also, the Microsoft implementation of std::map performs a heap allocation in the constructor (where the GNU implementation does not appear to), and this happens long before Start() is ever called, so I needed to perform a Mimick operation just to get it to initialize it's "vital" functions. It would be nice if an explicit method existed to do this.

I'm still having issues with aborts in Windows in my local builds, but ci.ros2.org seems to show that it's working as expected in both Release and Debug modes. Maybe we need to update Visual Studio on icecube.

Copy link
Contributor

@brawner brawner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really excited about those results. Showing practically no timing overhead during the timing runs, but also getting correct memory allocation results. Very cool. It might be helpful to run a benchmark with thousands of allocations to match creating/destroying a node to see if the pointer map is prohibitively expensive.

memory_manager->Reset();
}

void MimickPerformanceTest::set_are_allocation_measurements_active(bool value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth renaming this function since it's no longer a simple setter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about PauseRecording/ResumeRecording to mirror benchmark::State's PauseTiming/ResumeTiming?

@brawner
Copy link
Contributor

brawner commented Nov 23, 2020

I also don't have any good feedback to provide about the two main questions you are seeking answers to. Unfortunately.

Tagging @hidmic to get his eyes on this PR to see if he had any general recommendations.

Signed-off-by: Scott K Logan <logans@cottsay.net>
@cottsay
Copy link
Member Author

cottsay commented Nov 23, 2020

It might be helpful to run a benchmark with thousands of allocations to match creating/destroying a node to see if the pointer map is prohibitively expensive.

Good idea, I'll look into that. I'm not sure I have an alternative if it isn't - I don't see any other way to track the high water mark memory usage without it (unless I'm missing something obvious).

The map is the only reason I'm using a mutex, otherwise atomic variables would be sufficient.

@brawner
Copy link
Contributor

brawner commented Nov 23, 2020

I don't see any other way to track the high water mark memory usage without it (unless I'm missing something obvious).

What about allocating sizeof(size_t) + size and using part of the allocated space for the highwater mark tracking? Then the on_malloc/on_realloc would just return ptr + sizeof(size_t)

@cottsay
Copy link
Member Author

cottsay commented Nov 23, 2020

Tagging @hidmic to get his eyes on this PR to see if he had any general recommendations.

There are four Mimick or mimick_vendor changes that would benefit this effort:

  1. Ability to build a SHARED library. Mimick is explicitly STATIC, which I don't really care about, but it doesn't build with PIC, so it forces downstream artifacts to also be STATIC (again, unless I'm missing something obvious). Maybe it's a technical limitation that it needs to be static - I'm not sure.
  2. Explicit initialization function. I need the "vital" functions to be initialized for the mmk_allocator to work. What I have right now is hacky - it would be nice if mmk_init was callable. Alternatively, we could submit the mmk_allocator upstream - it might be useful to others.
  3. There are symbols getting defined when you include mimick.h. You can reproduce this by creating two empty cpp files that both #include "mimick.h" and try to compile them into the same executable or library. There is a symbol collision (only C++ - C works just fine).
  4. The headers getting installed don't appear to have the right hierarchy. I would expect /opt/ros/rolling/include/mimick.h and /opt/ros/rolling/include/mimick/<others>.h so that I could compile the Mimick sample as-is. Right now, we're doing #include "mimick/mimick.h", where the sample makes it look like we should just #include <mimick.h>.

@cottsay
Copy link
Member Author

cottsay commented Nov 23, 2020

What about allocating sizeof(size_t) + size and using part of the allocated space for the highwater mark tracking? Then the on_malloc/on_realloc would just return ptr + sizeof(size_t)

We'd have to reverse it, but maybe. If we allocated sizeof(size_t) + ptr and stored the size BEFORE the allocation, and then whenever we get a free we subtract sizeof(size_t) from the pointer to figure how much memory we're about to free, we could do it without the map.

There's one huge catch though - if ANY memory is freed during the test that wasn't allocated during the test, we'll probably segfault. Which is really not ideal.

@cottsay
Copy link
Member Author

cottsay commented Nov 23, 2020

@brawner - we'd have to make explicit implementations for each platform, but this may be an option for getting rid of the map: https://stackoverflow.com/a/1281720

If you think about it though, our high water mark statistic is still going to get messed up if you try to free something that wasn't allocated when timing was enabled. The whole "pause recording" think could make it even worse. I'm not sure we can avoid at least maintaining a list of what allocations were performed while recording is enabled...

@brawner
Copy link
Contributor

brawner commented Nov 23, 2020

That's a pretty good concern I didn't fully think through. I agree that it gets tricky during paused recording. It might still have to allocate the space and just set the memory size to 0 so we know it's an untracked pointer. That doesn't solve the allocate before tracking and free during issue.

I'm not expecting std::map to be a problem for something less that 10,000 items though, just thought it would be good to get a rough handle on impact.

@cottsay
Copy link
Member Author

cottsay commented Nov 23, 2020

Another thing to think about is that we've talked about the need to filter memory recording based on which thread is performing the operation. Depending on exactly which way we implement this (i.e. allow-list or block-list of thread IDs), we'll probably have a std::set that will need to be protected as well. Might be easier to just keep the mutex.

Also store the size of the allocation at the beginning of the
allocation. Note that this could fail catastrophically if the code under
test overwrites that block, but this approach appears to be more
performant.

Signed-off-by: Scott K Logan <logans@cottsay.net>
Signed-off-by: Scott K Logan <logans@cottsay.net>
The calloc function isn't one of the Mimick "vital" functions, so we
don't have a safe function to call in our stub. Since timing performance
isn't really a concern here, and our underlying infrastructure can't
handle operations bigger than size_t anyway, I just made the stub call
malloc and memset to get similar behavior.

Signed-off-by: Scott K Logan <logans@cottsay.net>
Signed-off-by: Scott K Logan <logans@cottsay.net>
Copy link

@hidmic hidmic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ability to build a SHARED library. Mimick is explicitly STATIC, which I don't really care about, but it doesn't build with PIC, so it forces downstream artifacts to also be STATIC (again, unless I'm missing something obvious). Maybe it's a technical limitation that it needs to be static - I'm not sure.

I wouldn't expect havoc if we flip -fPIC on, but I haven't tried. Perhaps I'm missing some obscure detail too.

Explicit initialization function. I need the "vial" functions to be initialized for the mmk_allocator to work. What I have right now is hacky - it would be nice if mmk_init was callable. Alternatively, we could submit the mmk_allocator upstream - it might be useful to others.

Pushing mmk_allocator into mimick sounds reasonable.

There are symbols getting defined when you include mimick.h. You can reproduce this by creating two empty cpp files that both #include "mimick.h" and try to compile them into the same executable or library. There is a symbol collision (only C++ - C works just fine).

I bet it's bad inlining within literal.h.

The headers getting installed don't appear to have the right hierarchy. I would expect /opt/ros/rolling/include/mimick.h and /opt/ros/rolling/include/mimick/.h so that I could compile the Mimick sample as-is. Right now, we're doing #include "mimick/mimick.h", where the sample makes it look like we should just #include <mimick.h>.

Hmm, not sure I follow you here. I can see #include "mimick/mimick.h" in the sample. This is something we did to please cpplint.

&std::remove_reference<decltype(*inst)>::type::func, \
std::remove_reference<decltype(*inst)>::type, \
__VA_ARGS__>, \
inst);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cottsay meta: some of this machinery is already available at brawner/test_mocking_utils#1. This one's neat though, perhaps we can mix and match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any plans of making test_mocking_utils official. So if you want to make use of the ideas and steal code, feel free to incorporate them here.

decltype(benchmark::MemoryManager::Result::max_bytes_used) max_bytes_used;
decltype(benchmark::MemoryManager::Result::num_allocs) num_allocs;
std::unique_ptr<std::unordered_set<void *, std::hash<void *>, std::equal_to<void *>,
mmk_allocator<void *>>> ptr_set;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cottsay why the extra level of indirection using std::unique_ptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that so that I could forward-declare mmk_allocator and avoid including "mimick.h" in this public API header.

@cottsay
Copy link
Member Author

cottsay commented Dec 1, 2020

Hmm, not sure I follow you here. I can see #include "mimick/mimick.h" in the sample. This is something we did to please cpplint.

I didn't know we'd forked Mimick. I was looking at the upstream: https://github.com/Snaipe/Mimick/blob/87e9898ebc9f1644c00df74dc1b34bf20391661b/sample/strdup/test.c#L1-L2

@hidmic
Copy link

hidmic commented Dec 1, 2020

I didn't know we'd forked Mimick.

Oh, yes. We did.

I should spend time submitting our changes upstream though...

Signed-off-by: Scott K Logan <logans@cottsay.net>
After showing that this approach works, this commit actually swaps the
PerformanceTestFixture from using osrf_testing_tools_cpp to using
Mimick.

Signed-off-by: Scott K Logan <logans@cottsay.net>
Signed-off-by: Scott K Logan <logans@cottsay.net>
@cottsay
Copy link
Member Author

cottsay commented Dec 3, 2020

After switching:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@cottsay
Copy link
Member Author

cottsay commented Dec 3, 2020

Alright, this isn't working out to be quite as easy as I thought it would. In the main test case, stubbing the four memory functions works just fine, but it doesn't appear to work in all cases. I tried running the downstream benchmarks for other packages with this change, and the stubbing fails with -ENOENT. It appears to do this across all platforms.

I tried a rather contrived change to iterate over each of the loaded libraries and specifically tried stubbing the memory functions in each one, but that seemed to be a little TOO aggressive. I had to specifically skip libstdc++.so to avoid failing memory operations. Additionally, there are failing alignment assertions coming out of Fast-DDS somewhere - I expect that there was a special allocation function that got stubbed and re-routed to the real allocation function that wasn't imposing the same constraints.

I'm not really sure where to go from here. Maybe Mimick isn't suited to do what we want it to here.

@audrow audrow changed the base branch from main to rolling June 28, 2022 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants