Does multithreading emphasize memory fragmentation? Does multithreading emphasize memory fragmentation? multithreading multithreading

Does multithreading emphasize memory fragmentation?


Ok, picked up the bait.

This is on a system with

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz4x5666.59 bogomipsLinux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linuxgcc version 4.4.5             total       used       free     shared    buffers     cachedMem:       8127172    4220560    3906612          0     374328    2748796-/+ buffers/cache:    1097436    7029736Swap:            0          0          0

Naive run

I just ran it

time ./ompmemtest Id 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MB Id 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MBId 2 about to release all memory: 257.043 MBId 1 done, total alloc'ed -1570.42MB Id 2 done, total alloc'ed -1569.96MB real    0m13.429suser    0m44.619ssys 0m6.000s

Nothing spectacular. Here is the simultaneous output of vmstat -S M 1

Vmstat raw data

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

Does that information mean anything to you?

Google Thread Caching Malloc

Now for real fun, add a little spice

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest Id 1 about to release all memory: 257.339 MBId 1 done, total alloc'ed -1570.42MB Id 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MB Id 2 about to release all memory: 257.043 MBId 2 done, total alloc'ed -1569.96MB Id 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MB real    0m11.663suser    0m44.255ssys 0m1.028s

Looks faster, not?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

In case you wanted to compare vmstat outputs

Valgrind --tool massif

This is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):

--------------------------------------------------------------------------------Command:            ./ompmemtestMassif arguments:   (none)ms_print arguments: massif.out.beforetcmalloc--------------------------------------------------------------------------------    GB1.009^                                                                     :       |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@:::      |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::      |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::      |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::      |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@:::      |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::   0 +----------------------------------------------------------------------->Gi     0                                                                   264.0Number of snapshots: 63 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPROFILE

Unfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtesttime HEAPPROFILE=/tmp/heapprofile ./ompmemtestStarting tracking the heapDumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)Id 0 about to release all memory: 258.144 MBId 0 done, total alloc'ed -1572.7MB Id 2 about to release all memory: 257.043 MBId 2 done, total alloc'ed -1569.96MB Id 3 about to release all memory: 257.854 MBId 3 done, total alloc'ed -1569.6MB Id 1 about to release all memory: 257.339 MBId 1 done, total alloc'ed -1570.42MB Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)real    0m11.981suser    0m44.455ssys 0m1.124s

Contact me for full logs/details

Update

To the comments: I updated the program

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200@@ -13,8 +13,8 @@ void runParallelAllocTest() {    // constants-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread-   const int  NUM_THREADS = 4;       // how many threads?+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread+   const int  NUM_THREADS = 8;        // how many threads?    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Tasks: 125 total, 2 running  2  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Load average: 8.09 5.24 2.37   3  [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%]     Uptime: 01:54:22  4  [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]  Mem[|||||||||||||||||||||||||||||||             3055/7936MB]  Swp[                                                  0/0MB]  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command 4330 sehe        8  20   0 2635M 2286M   908 R 368. 28.8 15:35.01 ./ompmemtest

Comparing results with a tcmalloc run: 4m12s, similar top stats has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 172 total, 2 running  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 7.39 2.92 1.11   3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 11:12:25  4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]  Mem[||||||||||||||||||||||||||||||||||||||||||||              3278/7936MB]  Swp[                                                                0/0MB]  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command14391 sehe        8  20   0 2251M 2179M  1148 R 379. 27.5  8:08.92 ./ompmemtest


When linking the test program with google's tcmalloc library, the executable doesn't only run ~10% faster, but shows greatly reduced or insignificant memory fragmentation as well:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

From the data I have, the answer appears to be:

Multithreaded access to the heap can emphasize fragmentation if the employed heap library does not deal well with concurrent access and if the processor fails to execute the threads truly concurrently.

The tcmalloc library shows no significant memory fragmentation running the same program that previously caused ~400MB to be lost in fragmentation.

But why does that happen ?

The best idea I have to offer here is some sort of locking artifact within the heap.

The test program will allocate randomly sized blocks of memory, freeing up blocks allocated early in the program to stay within its memory limit. When one thread is in the process of releasing old memory which is in a heap block on the 'left', it might actually be halted as another thread is scheduled to run, leaving a (soft) lock on that heap block. The newly scheduled thread wants to allocate memory, but may not even read that heap block on the 'left' side to check for free memory as it is currently being changed. Hence it might end up using a new heap block unnecessarily from the 'right'.

This process could look like a heap-block-shifting, where the the first blocks (on the left) remain only sparsely used and fragmented, forcing new blocks to be used on the right.

Lets restate that this fragmentation issue only occurs for me if I use 4 or more threads on a dual core system which can only handle two threads more or less concurrently. When only two threads are used, the (soft) locks on the heap will be held short enough not to block the other thread who wants to allocate memory.

Also, as a disclaimer, I didn't check the actual code of the glibc heap implementation, nor am I anything more than novice in the field of memory allocators - all I wrote is just how it appears to me which makes it pure speculation.

Another interesting read might be the tcmalloc documentation, which states common problems with heaps and multi-threaded access, some of which may have played their role in the test program too.

Its worth noting that it will never return memory to the system (see Caveats paragraph in tcmalloc documentation)


Yes the default malloc (Depending on linux version) does some crazy stuff which fails massively in some multi threaded applications. Specifically it keeps almost per thread heaps (arenas) to avoid locking. This is much faster than a single heap for all threads, but massively memory inefficient (sometimes). You can tune this by using code like this which turns off the multiple arenas (this kills performance so don't do this if you have lots of small allocations!)

rv = mallopt(-7, 1);  // M_ARENA_TESTrv = mallopt(-8, 1);  // M_ARENA_MAX

Or as others suggested using various replacements for malloc.

Basically it's impossible for a general purpose malloc to always be efficient as it doesn't know how it's going to be used.

ChrisP.