Understanding Linux virtual memory: valgrind's massif output shows major differences with and without --pages-as-heap Understanding Linux virtual memory: valgrind's massif output shows major differences with and without --pages-as-heap linux linux

Understanding Linux virtual memory: valgrind's massif output shows major differences with and without --pages-as-heap


I'll try to write a short summary of what I learned, while trying to figure out what's happening.
Note: this answer is possible thanks to @Lawrence - appreciated!


Long story short

This has absolutely nothing to do with Linux/kernel (virtual) memory management, nor with std::string.
It's all about the glibc's memory allocator - it just allocates huge areas of memory on the first (and not only, of course) dynamic allocation (per thread).


Details

MCVE

#include <thread>#include <vector>#include <chrono>int main() {    std::vector<std::thread> workers;    for( unsigned i = 0; i < 192; ++i )        workers.emplace_back([]{            const auto x = std::make_unique<int>(rand());            while (true) std::this_thread::sleep_for(std::chrono::seconds(1));});    workers.back().join();}

Please ignore the crappy handling of the threads, I wanted this to be as short as possible.

Commands

Compile: g++ --std=c++14 -fno-inline -g3 -O0 -pthread test.cpp.
Profile: valgrind --tool=massif --pages-as-heap=[no|yes] ./a.out

Memory usage

top shows 7'815'012 KiB virtual memory.
pmap also shows 7'815'016 KiB virtual memory.
Similar result is shown by massif with pages-as-heap=yes: 7'817'088 KiB, see below.
On the other hand, massif with pages-as-heap=no is drastically different - around 133 KiB!

Massif output with pages-as-heap=yes

Memory usage before killing the program:

100.00% (8,004,698,112B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.->99.78% (7,986,741,248B) 0x54E0679: mmap (mmap.c:34)| ->46.11% (3,690,987,520B) 0x545C3CF: new_heap (arena.c:438)| | ->46.11% (3,690,987,520B) 0x545CC1F: arena_get2.part.3 (arena.c:646)| |   ->46.11% (3,690,987,520B) 0x5463248: malloc (malloc.c:2911)| |     ->46.11% (3,690,987,520B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)| |       ->46.11% (3,690,987,520B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)| |         ->46.11% (3,690,987,520B) 0x400EC5: main::{lambda()| |           ->46.11% (3,690,987,520B) 0x40225C: void std::_Bind_simple<main::{lambda()| |             ->46.11% (3,690,987,520B) 0x402194: std::_Bind_simple<main::{lambda()| |               ->46.11% (3,690,987,520B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()| |                 ->46.11% (3,690,987,520B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)| |                   ->46.11% (3,690,987,520B) 0x51C96B8: start_thread (pthread_create.c:333)| |                     ->46.11% (3,690,987,520B) 0x54E63DB: clone (clone.S:109)| |                       | ->33.53% (2,684,354,560B) 0x545C35B: new_heap (arena.c:427)| | ->33.53% (2,684,354,560B) 0x545CC1F: arena_get2.part.3 (arena.c:646)| |   ->33.53% (2,684,354,560B) 0x5463248: malloc (malloc.c:2911)| |     ->33.53% (2,684,354,560B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)| |       ->33.53% (2,684,354,560B) 0x4026D0: std::_MakeUniq<int>::__single_object std::make_unique<int, int>(int&&) (unique_ptr.h:765)| |         ->33.53% (2,684,354,560B) 0x400EC5: main::{lambda()| |           ->33.53% (2,684,354,560B) 0x40225C: void std::_Bind_simple<main::{lambda()| |             ->33.53% (2,684,354,560B) 0x402194: std::_Bind_simple<main::{lambda()| |               ->33.53% (2,684,354,560B) 0x402102: std::thread::_Impl<std::_Bind_simple<main::{lambda()| |                 ->33.53% (2,684,354,560B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)| |                   ->33.53% (2,684,354,560B) 0x51C96B8: start_thread (pthread_create.c:333)| |                     ->33.53% (2,684,354,560B) 0x54E63DB: clone (clone.S:109)| |                       | ->20.13% (1,611,399,168B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)|   ->20.13% (1,611,399,168B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)|     ->20.13% (1,611,399,168B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)|       ->20.13% (1,611,399,168B) 0x40139A: std::thread::thread<main::{lambda()|         ->20.13% (1,611,399,168B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)|           ->20.13% (1,611,399,168B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)|             ->19.19% (1,535,864,832B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()|             | ->19.19% (1,535,864,832B) 0x400F47: main (test.cpp:10)|             |   |             ->00.94% (75,534,336B) in 1+ places, all below ms_print's threshold (01.00%)|             ->00.22% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)

Massif output with pages-as-heap=no

Memory usage before killing the program:

--------------------------------------------------------------------------------  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)-------------------------------------------------------------------------------- 68      2,793,125          143,280          136,676         6,604            095.39% (136,676B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.->50.74% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)| ->50.74% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)|   ->50.74% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)|     ->50.74% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)|       ->36.58% (52,416B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)| ->36.58% (52,416B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)|   ->36.58% (52,416B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)|     ->36.58% (52,416B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)|       ->36.58% (52,416B) 0x40139A: std::thread::thread<main::{lambda()|         ->36.58% (52,416B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)|           ->36.58% (52,416B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)|             ->34.77% (49,824B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()|             | ->34.77% (49,824B) 0x400F47: main (test.cpp:10)|             |   |             ->01.81% (2,592B) 0x4010FF: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()|               ->01.81% (2,592B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()|                 ->01.81% (2,592B) 0x400F47: main (test.cpp:10)|                   ->06.13% (8,784B) 0x401B4B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()| ->06.13% (8,784B) 0x401A60: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<main::{lambda()|   ->06.13% (8,784B) 0x40194D: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<main::{lambda()|     ->06.13% (8,784B) 0x401894: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()|       ->06.13% (8,784B) 0x40183A: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()|         ->06.13% (8,784B) 0x4017C7: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()|           ->06.13% (8,784B) 0x4016AB: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()|             ->06.13% (8,784B) 0x40155E: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<main::{lambda()|               ->06.13% (8,784B) 0x401374: std::thread::thread<main::{lambda()|                 ->06.13% (8,784B) 0x4012AE: _ZN9__gnu_cxx13new_allocatorISt6threadE9constructIS1_IZ4mainEUlvE_EEEvPT_DpOT0_ (new_allocator.h:120)|                   ->06.13% (8,784B) 0x401075: _ZNSt16allocator_traitsISaISt6threadEE9constructIS0_IZ4mainEUlvE_EEEvRS1_PT_DpOT0_ (alloc_traits.h:527)|                     ->05.83% (8,352B) 0x401009: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()|                     | ->05.83% (8,352B) 0x400F47: main (test.cpp:10)|                     |   |                     ->00.30% (432B) in 1+ places, all below ms_print's threshold (01.00%)|                     ->01.43% (2,048B) 0x403432: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)| ->01.43% (2,048B) 0x4032CF: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)|   ->01.43% (2,048B) 0x4030B8: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)|     ->01.43% (2,048B) 0x4010B6: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<main::{lambda()|       ->01.43% (2,048B) 0x40103D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<main::{lambda()|         ->01.43% (2,048B) 0x400F47: main (test.cpp:10)|           ->00.51% (724B) in 1+ places, all below ms_print's threshold (01.00%)

What the freak happens?

pages-as-heap=no

With pages-as-heap=no the things look reasonable - let's not inspect it. As expected, everything ends up with malloc/new/new[] and the memory usage is small enough not to worry us - these are the high level allocations.

pages-as-heap=yes

But see pages-as-heap=yes? ~8GiB virtual memory with this simple code?

Let's inspect the stack traces.

pthread_create

Let's start with the easier one: the one, that ends with pthread_create.

massif reports 1,611,399,168 bytes of allocated memory - this is exactly 192 * 8'196 KiB, meaning - 192 threads * 8MiB, which is the default max stack size of a thread in Linux.

Note, that 8'196 KiB is not exactly 8 MiB (8'192 KiB). I don't know where this difference comes from, but it's not significant at the moment.

std::make_unique<int>

OK, let's now see the other two stacks... wait, they are exactly the same? Yeah, massif's documentation explains this, I didn't completely understand it, but it's also not significant. They show exactly the same stack. Let's combine the results and examine them together.

The memory usage from these two stacks combined is 6'375'342'080 bytes and all of them are caused by our simple std::make_unique<int>!

Let's take a step back: if we run the same experiment, but with a simple thread, we will see, that this int allocation causes allocating 67'108'864 bytes of memory, which is exactly 64 MB. What happens??

It all comes down to the implementation of malloc (as we all know, that new/new[] is internally implemented with malloc.. by default).

malloc internally uses a memory allocator, called ptmalloc2 - the default memory allocator in Linux, that supports threads.

Simply put, this allocator deals with the following terms:

  • per thread arena: a huge area of memory; usually per thread, for performance reasons; not all software threads have their own per-thread-arenas, this usually depends on the number of hardware threads (and other details, I guess);
  • heap: the arenas are divided into heaps;
  • chunks: the heaps are divided into smaller areas of memory, called chunks.

There are a lot of details about these things, will post some interesting links a bit later, although this should be enough for the reader to do their own research - these are really low-level and deep things, related to C++ memory management.

So, let's go back to our test with a single thread - allocated 64 MiB for a single int?? Let's see again the stack trace and concentrate at its end:

mmap (mmap.c:34)new_heap (arena.c:438)arena_get2.part.3 (arena.c:646)malloc (malloc.c:2911)

Surprise, surprise: malloc calls arena_get2, which calls new_heap, which leads us to mmap (mmap and brk are the low level system calls, used for memory allocation in Linux). And this is reported to allocate exactly 64 MiB memory.

OK, let's now go back to our original example with the 192 threads and our huge number 6'375'342'080 - this is exactly 95 * 64 MiB!

Why exactly 95 - I can't really say, I stopped digging, but the fact, that the big number is divisible to 64 MiB was good enough for me.

You can dig a lot deeper, if necessary.

Useful links

Really cool explanatory article: Understanding glibc malloc, by sploitfun

A more formal/official documentation: The GNU allocator

A cool stack exchange question: How does glibc malloc works

Others:

If some of these links are broken at the moment of reading this post, it should be fairly easy to find similar articles. This topic is very popular, if you know what to look for and how.

Thanks

I hope these observations give good high-level description of the whole picture and also give enough food for further extended research.

Feel free to comment / (suggest) edit / correct / extend / etc.


massif with --pages-as-heap=yes and the top column you are observing both measure the virtual memory used by a process. This virtual memory includes all space mmap'd in the implementation of malloc and during the creation of threads. For example, the default stack size for a thread will be 8192k which is reflected in the creation of each thread and contributes to the virtual memory footprint.

The specific allocation scheme will be dependent on implementation but it seems that the first heap allocation on a new thread will mmap roughly 65 megabytes of space. This can be viewed by looking at the pmap output for a process.

Excerpt from a very similar program to the example:

75170:   ./a.out0000000000400000     24K r-x-- a.out0000000000605000      4K r---- a.out0000000000606000      4K rw--- a.out0000000001b6a000    200K rw---   [ anon ]00007f669dfa4000      4K -----   [ anon ]00007f669dfa5000   8192K rw---   [ anon ]00007f669e7a5000      4K -----   [ anon ]00007f669e7a6000   8192K rw---   [ anon ]00007f669efa6000      4K -----   [ anon ]00007f669efa7000   8192K rw---   [ anon ]...00007f66cb800000   8192K rw---   [ anon ]00007f66cc000000    132K rw---   [ anon ]00007f66cc021000  65404K -----   [ anon ]00007f66d0000000    132K rw---   [ anon ]00007f66d0021000  65404K -----   [ anon ]00007f66d4000000    132K rw---   [ anon ]00007f66d4021000  65404K -----   [ anon ]...00007f6880586000   8192K rw---   [ anon ]00007f6880d86000   1056K r-x-- libm-2.23.so00007f6880e8e000   2044K ----- libm-2.23.so...00007f6881c08000      4K r---- libpthread-2.23.so00007f6881c09000      4K rw--- libpthread-2.23.so00007f6881c0a000     16K rw---   [ anon ]00007f6881c0e000    152K r-x-- ld-2.23.so00007f6881e09000     24K rw---   [ anon ]00007f6881e33000      4K r---- ld-2.23.so00007f6881e34000      4K rw--- ld-2.23.so00007f6881e35000      4K rw---   [ anon ]00007ffe9d75b000    132K rw---   [ stack ]00007ffe9d7f8000     12K r----   [ anon ]00007ffe9d7fb000      8K r-x--   [ anon ]ffffffffff600000      4K r-x--   [ anon ] total          7815008K

It seems that malloc becomes more conservative as you approach some threshold of virtual memory per process. Also, my comment about libraries being mapped separately was misguided (they should be shared per process)


This is only a "kind of" answer (from the Valgrind perspective). The problem of memory pools, in particular with C++ strings, has been known for some time. The Valgrind manual has a section on leaks in C++ strings, suggesting you try to set the GLIBCXX_FORCE_NEW environment variable.

Additionally, for GCC6 and later, Valgrind has added hooks to cleanup still reachable memory in libstdc++. The Valgrind bugzilla entry is here and the GCC one is here.

I don't see why such small allocations blow up to so many gigabytes (over 12 Gbytes for a 64bit executable, CentOS 6.6, GCC 6.2).