Commit | Line | Data |
---|---|---|
f0ed8226 MSO |
1 | nedalloc v1.05 15th June 2008: |
2 | -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= | |
3 | ||
4 | by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/) | |
5 | ||
6 | Enclosed is nedalloc, an alternative malloc implementation for multiple | |
7 | threads without lock contention based on dlmalloc v2.8.4. It is more | |
8 | or less a newer implementation of ptmalloc2, the standard allocator in | |
9 | Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread | |
10 | cache for maximum CPU scalability. | |
11 | ||
12 | It is licensed under the Boost Software License which basically means | |
13 | you can do anything you like with it. This does not apply to the malloc.c.h | |
14 | file which remains copyright to others. | |
15 | ||
16 | It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) | |
17 | and Apple MacOS X (x86). It works very well on all of these and is very | |
18 | significantly faster than the system allocator on all of these platforms. | |
19 | ||
20 | By literally dropping in this allocator as a replacement for your system | |
21 | allocator, you can see real world improvements of up to three times in normal | |
22 | code! | |
23 | ||
24 | To use: | |
25 | -=-=-=- | |
26 | Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project. | |
27 | Configure using the instructions in nedmalloc.h. Run and enjoy. | |
28 | ||
29 | To test, compile test.c. It will run a comparison between your system | |
30 | allocator and nedalloc and tell you how much faster nedalloc is. It also | |
31 | serves as an example of usage. | |
32 | ||
33 | Notes: | |
34 | -=-=-= | |
35 | If you want the very latest version of this allocator, get it from the | |
36 | TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc | |
37 | ||
38 | Because of how nedalloc allocates an mspace per thread, it can cause | |
39 | severe bloating of memory usage under certain allocation patterns. | |
40 | You can substantially reduce this wastage by setting MAXTHREADSINPOOL | |
41 | or the threads parameter to nedcreatepool() to a fraction of the number of | |
42 | threads which would normally be in a pool at once. This will reduce | |
43 | bloating at the cost of an increase in lock contention. If allocated size | |
44 | is less than THREADCACHEMAX, locking is avoided 90-99% of the time and | |
45 | if most of your allocations are below this value, you can safely set | |
46 | MAXTHREADSINPOOL to one. | |
47 | ||
48 | You will suffer memory leakage unless you call neddisablethreadcache() | |
49 | per pool for every thread which exits. This is because nedalloc cannot | |
50 | portably know when a thread exits and thus when its thread cache can | |
51 | be returned for use by other code. Don't forget pool zero, the system pool. | |
52 | ||
53 | For C++ type allocation patterns (where the same sizes of memory are | |
54 | regularly allocated and deallocated as objects are created and destroyed), | |
55 | the threadcache always benefits performance. If however your allocation | |
56 | patterns are different, searching the threadcache may significantly slow | |
57 | down your code - as a rule of thumb, if cache utilisation is below 80% | |
58 | (see the source for neddisablethreadcache() for how to enable debug | |
59 | printing in release mode) then you should disable the thread cache for | |
60 | that thread. You can compile out the threadcache code by setting | |
61 | THREADCACHEMAX to zero. | |
62 | ||
63 | Speed comparisons: | |
64 | -=-=-=-=-=-=-=-=-= | |
65 | See Benchmarks.xls for details. | |
66 | ||
67 | The enclosed test.c can do two things: it can be a torture test or a speed | |
68 | test. The speed test is designed to be a representative synthetic | |
69 | memory allocator test. It works by randomly mixing allocations with frees | |
70 | with half of the allocation sizes being a two power multiple less than | |
71 | 512 bytes (to mimic C++ stack instantiated objects) and the other half | |
72 | being a simple random value less than 16Kb. | |
73 | ||
74 | The real world code results are from Tn's TestIO benchmark. This is a | |
75 | heavily multithreaded and memory intensive benchmark with a lot of branching | |
76 | and other stuff modern processors don't like so much. As you'll note, the | |
77 | test doesn't show the benefits of the threadcache mostly due to the saturation | |
78 | of the memory bus being the limiting factor. | |
79 | ||
80 | ChangeLog: | |
81 | -=-=-=-=-= | |
82 | v1.05 15th June 2008: | |
83 | * { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to | |
84 | Markus Elfring for reporting this. | |
85 | * { 1043 } Fixed a segfault when freeing memory allocated using | |
86 | nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this. | |
87 | ||
88 | v1.04 14th July 2007: | |
89 | * Fixed a bug with the new optimised implementation that failed to lock | |
90 | on a realloc under certain conditions. | |
91 | * Fixed lack of thread synchronisation in InitPool() causing pool corruption | |
92 | * Fixed a memory leak of thread cache contents on disabling. Thanks to Earl | |
93 | Chew for reporting this. | |
94 | * Added a sanity check for freed blocks being valid. | |
95 | * Reworked test.c into being a torture test. | |
96 | * Fixed GCC assembler optimisation misspecification | |
97 | ||
98 | v1.04alpha_svn915 7th October 2006: | |
99 | * Fixed failure to unlock thread cache list if allocating a new list failed. | |
749f763d | 100 | Thanks to Dmitry Chichkov for reporting this. Further thanks to Aleksey Sanin. |
f0ed8226 MSO |
101 | * Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for |
102 | reporting this. | |
8d8136c3 | 103 | * Made config defines #ifndef so they can be overridden by the build system. |
f0ed8226 MSO |
104 | Thanks to Aleksey Sanin for suggesting this. |
105 | * Fixed deadlock in nedprealloc() due to unnecessary locking of preferred | |
106 | thread mspace when mspace_realloc() always uses the original block's mspace | |
107 | anyway. Thanks to Aleksey Sanin for reporting this. | |
108 | * Made some speed improvements by hacking mspace_malloc() to no longer lock | |
109 | its mspace, thus allowing the recursive mutex implementation to be removed | |
110 | with an associated speed increase. Thanks to Aleksey Sanin for suggesting this. | |
111 | * Fixed a bug where allocating mspaces overran its max limit. Thanks to | |
112 | Aleksey Sanin for reporting this. | |
113 | ||
114 | v1.03 10th July 2006: | |
115 | * Fixed memory corruption bug in threadcache code which only appeared with >4 | |
116 | threads and in heavy use of the threadcache. | |
117 | ||
118 | v1.02 15th May 2006: | |
119 | * Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and | |
120 | improving performance still further. Speed is now up to twice the speed of v1.01 | |
121 | (average is 67% faster). | |
122 | * Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov | |
123 | for reporting this. | |
124 | * Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov | |
125 | for reporting this. | |
126 | * Added Apple Mac OS X support. | |
127 | ||
128 | v1.01 24th February 2006: | |
129 | * Fixed multiprocessor scaling problems by removing sources of cache sloshing | |
130 | * Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following: | |
131 | 1. size2binidx() wasn't working for default code path (non x86) | |
132 | 2. Fixed failure to release mspace lock under certain circumstances which | |
133 | caused a deadlock | |
134 | ||
135 | v1.00 1st January 2006: | |
136 | * First release |