[git] / compat / nedmalloc / Readme.txt

nedalloc v1.05 15th June 2008:
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/)

Enclosed is nedalloc, an alternative malloc implementation for multiple
threads without lock contention based on dlmalloc v2.8.4. It is more
or less a newer implementation of ptmalloc2, the standard allocator in
Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread
cache for maximum CPU scalability.

It is licensed under the Boost Software License which basically means
you can do anything you like with it. This does not apply to the malloc.c.h
file which remains copyright to others.

It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64)
and Apple MacOS X (x86). It works very well on all of these and is very
significantly faster than the system allocator on all of these platforms.

By literally dropping in this allocator as a replacement for your system
allocator, you can see real world improvements of up to three times in normal
code!

To use:
-=-=-=-
Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project.
Configure using the instructions in nedmalloc.h. Run and enjoy.

To test, compile test.c. It will run a comparison between your system
allocator and nedalloc and tell you how much faster nedalloc is. It also
serves as an example of usage.

Notes:
-=-=-=
If you want the very latest version of this allocator, get it from the
TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc

Because of how nedalloc allocates an mspace per thread, it can cause
severe bloating of memory usage under certain allocation patterns.
You can substantially reduce this wastage by setting MAXTHREADSINPOOL
or the threads parameter to nedcreatepool() to a fraction of the number of
threads which would normally be in a pool at once. This will reduce
bloating at the cost of an increase in lock contention. If allocated size
is less than THREADCACHEMAX, locking is avoided 90-99% of the time and
if most of your allocations are below this value, you can safely set
MAXTHREADSINPOOL to one.

You will suffer memory leakage unless you call neddisablethreadcache()
per pool for every thread which exits. This is because nedalloc cannot
portably know when a thread exits and thus when its thread cache can
be returned for use by other code. Don't forget pool zero, the system pool.

For C++ type allocation patterns (where the same sizes of memory are
regularly allocated and deallocated as objects are created and destroyed),
the threadcache always benefits performance. If however your allocation
patterns are different, searching the threadcache may significantly slow
down your code - as a rule of thumb, if cache utilisation is below 80%
(see the source for neddisablethreadcache() for how to enable debug
printing in release mode) then you should disable the thread cache for
that thread. You can compile out the threadcache code by setting
THREADCACHEMAX to zero.

Speed comparisons:
-=-=-=-=-=-=-=-=-=
See Benchmarks.xls for details.

The enclosed test.c can do two things: it can be a torture test or a speed
test. The speed test is designed to be a representative synthetic
memory allocator test. It works by randomly mixing allocations with frees
with half of the allocation sizes being a two power multiple less than
512 bytes (to mimic C++ stack instantiated objects) and the other half
being a simple random value less than 16Kb.

The real world code results are from Tn's TestIO benchmark. This is a
heavily multithreaded and memory intensive benchmark with a lot of branching
and other stuff modern processors don't like so much. As you'll note, the
test doesn't show the benefits of the threadcache mostly due to the saturation
of the memory bus being the limiting factor.

ChangeLog:
-=-=-=-=-=
v1.05 15th June 2008:
 * { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to
Markus Elfring for reporting this.
 * { 1043 } Fixed a segfault when freeing memory allocated using
nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this.

v1.04 14th July 2007:
 * Fixed a bug with the new optimised implementation that failed to lock
on a realloc under certain conditions.
 * Fixed lack of thread synchronisation in InitPool() causing pool corruption
 * Fixed a memory leak of thread cache contents on disabling. Thanks to Earl
Chew for reporting this.
 * Added a sanity check for freed blocks being valid.
 * Reworked test.c into being a torture test.
 * Fixed GCC assembler optimisation misspecification

v1.04alpha_svn915 7th October 2006:
 * Fixed failure to unlock thread cache list if allocating a new list failed.
Thanks to Dmitry Chichkov for reporting this. Further thanks to Aleksey Sanin.
 * Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for
reporting this.
 * Made config defines #ifndef so they can be overridden by the build system.
Thanks to Aleksey Sanin for suggesting this.
 * Fixed deadlock in nedprealloc() due to unnecessary locking of preferred
thread mspace when mspace_realloc() always uses the original block's mspace
anyway. Thanks to Aleksey Sanin for reporting this.
 * Made some speed improvements by hacking mspace_malloc() to no longer lock
its mspace, thus allowing the recursive mutex implementation to be removed
with an associated speed increase. Thanks to Aleksey Sanin for suggesting this.
 * Fixed a bug where allocating mspaces overran its max limit. Thanks to
Aleksey Sanin for reporting this.

v1.03 10th July 2006:
 * Fixed memory corruption bug in threadcache code which only appeared with >4
threads and in heavy use of the threadcache.

v1.02 15th May 2006:
 * Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and
improving performance still further. Speed is now up to twice the speed of v1.01
(average is 67% faster).
 * Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov
for reporting this.
 * Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov
for reporting this.
 * Added Apple Mac OS X support.

v1.01 24th February 2006:
 * Fixed multiprocessor scaling problems by removing sources of cache sloshing
 * Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following:
   1. size2binidx() wasn't working for default code path (non x86)
   2. Fixed failure to release mspace lock under certain circumstances which
      caused a deadlock

v1.00 1st January 2006:
 * First release
Commit	Line	Data
f0ed8226 MSO	1	nedalloc v1.05 15th June 2008:
	2	-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
	3
	4	by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/)
	5
	6	Enclosed is nedalloc, an alternative malloc implementation for multiple
	7	threads without lock contention based on dlmalloc v2.8.4. It is more
	8	or less a newer implementation of ptmalloc2, the standard allocator in
	9	Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread
	10	cache for maximum CPU scalability.
	11
	12	It is licensed under the Boost Software License which basically means
	13	you can do anything you like with it. This does not apply to the malloc.c.h
	14	file which remains copyright to others.
	15
	16	It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64)
	17	and Apple MacOS X (x86). It works very well on all of these and is very
	18	significantly faster than the system allocator on all of these platforms.
	19
	20	By literally dropping in this allocator as a replacement for your system
	21	allocator, you can see real world improvements of up to three times in normal
	22	code!
	23
	24	To use:
	25	-=-=-=-
	26	Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project.
	27	Configure using the instructions in nedmalloc.h. Run and enjoy.
	28
	29	To test, compile test.c. It will run a comparison between your system
	30	allocator and nedalloc and tell you how much faster nedalloc is. It also
	31	serves as an example of usage.
	32
	33	Notes:
	34	-=-=-=
	35	If you want the very latest version of this allocator, get it from the
	36	TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc
	37
	38	Because of how nedalloc allocates an mspace per thread, it can cause
	39	severe bloating of memory usage under certain allocation patterns.
	40	You can substantially reduce this wastage by setting MAXTHREADSINPOOL
	41	or the threads parameter to nedcreatepool() to a fraction of the number of
	42	threads which would normally be in a pool at once. This will reduce
	43	bloating at the cost of an increase in lock contention. If allocated size
	44	is less than THREADCACHEMAX, locking is avoided 90-99% of the time and
	45	if most of your allocations are below this value, you can safely set
	46	MAXTHREADSINPOOL to one.
	47
	48	You will suffer memory leakage unless you call neddisablethreadcache()
	49	per pool for every thread which exits. This is because nedalloc cannot
	50	portably know when a thread exits and thus when its thread cache can
	51	be returned for use by other code. Don't forget pool zero, the system pool.
	52
	53	For C++ type allocation patterns (where the same sizes of memory are
	54	regularly allocated and deallocated as objects are created and destroyed),
	55	the threadcache always benefits performance. If however your allocation
	56	patterns are different, searching the threadcache may significantly slow
	57	down your code - as a rule of thumb, if cache utilisation is below 80%
	58	(see the source for neddisablethreadcache() for how to enable debug
	59	printing in release mode) then you should disable the thread cache for
	60	that thread. You can compile out the threadcache code by setting
	61	THREADCACHEMAX to zero.
	62
	63	Speed comparisons:
	64	-=-=-=-=-=-=-=-=-=
65	See Benchmarks.xls for details.
66
67	The enclosed test.c can do two things: it can be a torture test or a speed
68	test. The speed test is designed to be a representative synthetic
69	memory allocator test. It works by randomly mixing allocations with frees
70	with half of the allocation sizes being a two power multiple less than
71	512 bytes (to mimic C++ stack instantiated objects) and the other half
72	being a simple random value less than 16Kb.
73
74	The real world code results are from Tn's TestIO benchmark. This is a
75	heavily multithreaded and memory intensive benchmark with a lot of branching
76	and other stuff modern processors don't like so much. As you'll note, the
77	test doesn't show the benefits of the threadcache mostly due to the saturation
78	of the memory bus being the limiting factor.
79
80	ChangeLog:
81	-=-=-=-=-=
82	v1.05 15th June 2008:
83	* { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to
84	Markus Elfring for reporting this.
85	* { 1043 } Fixed a segfault when freeing memory allocated using
86	nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this.
87
88	v1.04 14th July 2007:
89	* Fixed a bug with the new optimised implementation that failed to lock
90	on a realloc under certain conditions.
91	* Fixed lack of thread synchronisation in InitPool() causing pool corruption
92	* Fixed a memory leak of thread cache contents on disabling. Thanks to Earl
93	Chew for reporting this.
94	* Added a sanity check for freed blocks being valid.
95	* Reworked test.c into being a torture test.
96	* Fixed GCC assembler optimisation misspecification
97
98	v1.04alpha_svn915 7th October 2006:
99	* Fixed failure to unlock thread cache list if allocating a new list failed.
749f763d	100	Thanks to Dmitry Chichkov for reporting this. Further thanks to Aleksey Sanin.
f0ed8226 MSO	101	* Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for
f0ed8226 MSO	102	reporting this.
8d8136c3	103	* Made config defines #ifndef so they can be overridden by the build system.
f0ed8226 MSO	104	Thanks to Aleksey Sanin for suggesting this.
	105	* Fixed deadlock in nedprealloc() due to unnecessary locking of preferred
	106	thread mspace when mspace_realloc() always uses the original block's mspace
	107	anyway. Thanks to Aleksey Sanin for reporting this.
	108	* Made some speed improvements by hacking mspace_malloc() to no longer lock
	109	its mspace, thus allowing the recursive mutex implementation to be removed
	110	with an associated speed increase. Thanks to Aleksey Sanin for suggesting this.
	111	* Fixed a bug where allocating mspaces overran its max limit. Thanks to
	112	Aleksey Sanin for reporting this.
	113
	114	v1.03 10th July 2006:
	115	* Fixed memory corruption bug in threadcache code which only appeared with >4
	116	threads and in heavy use of the threadcache.
	117
	118	v1.02 15th May 2006:
	119	* Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and
	120	improving performance still further. Speed is now up to twice the speed of v1.01
	121	(average is 67% faster).
	122	* Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov
	123	for reporting this.
	124	* Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov
	125	for reporting this.
	126	* Added Apple Mac OS X support.
	127
	128	v1.01 24th February 2006:
	129	* Fixed multiprocessor scaling problems by removing sources of cache sloshing
	130	* Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following:
	131	1. size2binidx() wasn't working for default code path (non x86)
	132	2. Fixed failure to release mspace lock under certain circumstances which
	133	caused a deadlock
	134
	135	v1.00 1st January 2006:
	136	* First release