linux-2.6
16 years agoIPC/shared memory: introduce shmctl_down
Pierre Peiffer [Tue, 29 Apr 2008 08:00:47 +0000 (01:00 -0700)] 
IPC/shared memory: introduce shmctl_down

Currently, the way the different commands are handled in sys_shmctl introduces
some duplicated code.

This patch introduces the shmctl_down function to handle all the commands
requiring the rwmutex to be taken in write mode (ie IPC_SET and IPC_RMID for
now).  It is the equivalent function of semctl_down for shared memory.

This removes some duplicated code for handling these both commands and
harmonizes the way they are handled among all IPCs.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoIPC/semaphores: code factorisation
Pierre Peiffer [Tue, 29 Apr 2008 08:00:46 +0000 (01:00 -0700)] 
IPC/semaphores: code factorisation

Trivial patch which adds some small locking functions and makes use of them to
factorize some part of the code and to make it cleaner.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: re-enable msgmni automatic recomputing msgmni if set to negative
Nadia Derbey [Tue, 29 Apr 2008 08:00:45 +0000 (01:00 -0700)] 
ipc: re-enable msgmni automatic recomputing msgmni if set to negative

The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.

A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered.  With that new interface we avoid taking care of the states
changes in procfs.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: do not recompute msgmni anymore if explicitly set by user
Nadia Derbey [Tue, 29 Apr 2008 08:00:44 +0000 (01:00 -0700)] 
ipc: do not recompute msgmni anymore if explicitly set by user

Make msgmni not recomputed anymore upon ipc namespace creation / removal or
memory add/remove, as soon as it has been set from userland.

As soon as msgmni is explicitly set via procfs or sysctl(), the associated
callback routine is unregistered from the ipc namespace notifier chain.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: recompute msgmni on ipc namespace creation/removal
Nadia Derbey [Tue, 29 Apr 2008 08:00:44 +0000 (01:00 -0700)] 
ipc: recompute msgmni on ipc namespace creation/removal

Introduce a notification mechanism that aims at recomputing msgmni each time
an ipc namespace is created or removed.

The ipc namespace notifier chain already defined for memory hotplug management
is used for that purpose too.

Each time a new ipc namespace is allocated or an existing ipc namespace is
removed, the ipcns notifier chain is notified.  The callback routine for each
registered ipc namespace is then activated in order to recompute msgmni for
that namespace.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: invoke the ipcns notifier chain as a work item
Nadia Derbey [Tue, 29 Apr 2008 08:00:43 +0000 (01:00 -0700)] 
ipc: invoke the ipcns notifier chain as a work item

Make the memory hotplug chain's mutex held for a shorter time: when memory is
offlined or onlined a work item is added to the global workqueue.  When the
work item is run, it notifies the ipcns notifier chain with the
IPCNS_MEMCHANGED event.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: recompute msgmni on memory add / remove
Nadia Derbey [Tue, 29 Apr 2008 08:00:42 +0000 (01:00 -0700)] 
ipc: recompute msgmni on memory add / remove

Introduce the registration of a callback routine that recomputes msg_ctlmni
upon memory add / remove.

A single notifier block is registered in the hotplug memory chain for all the
ipc namespaces.

Since the ipc namespaces are not linked together, they have their own
notification chain: one notifier_block is defined per ipc namespace.

Each time an ipc namespace is created (removed) it registers (unregisters) its
notifier block in (from) the ipcns chain.  The callback routine registered in
the memory chain invokes the ipcns notifier chain with the IPCNS_LOWMEM event.
 Each callback routine registered in the ipcns namespace, in turn, recomputes
msgmni for the owning namespace.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: define the slab_memory_callback priority as a constant
Nadia Derbey [Tue, 29 Apr 2008 08:00:41 +0000 (01:00 -0700)] 
ipc: define the slab_memory_callback priority as a constant

This is a trivial patch that defines the priority of slab_memory_callback in
the callback chain as a constant.  This is to prepare for next patch in the
series.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: scale msgmni to the number of ipc namespaces
Nadia Derbey [Tue, 29 Apr 2008 08:00:40 +0000 (01:00 -0700)] 
ipc: scale msgmni to the number of ipc namespaces

Since all the namespaces see the same amount of memory (the total one) this
patch introduces a new variable that counts the ipc namespaces and divides
msg_ctlmni by this counter.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoipc: scale msgmni to the amount of lowmem
Nadia Derbey [Tue, 29 Apr 2008 08:00:39 +0000 (01:00 -0700)] 
ipc: scale msgmni to the amount of lowmem

On large systems we'd like to allow a larger number of message queues.  In
some cases up to 32K.  However simply setting MSGMNI to a larger value may
cause problems for smaller systems.

The first patch of this series introduces a default maximum number of message
queue ids that scales with the amount of lowmem.

Since msgmni is per namespace and there is no amount of memory dedicated to
each namespace so far, the second patch of this series scales msgmni to the
number of ipc namespaces too.

Since msgmni depends on the amount of memory, it becomes necessary to
recompute it upon memory add/remove.  In the 4th patch, memory hotplug
management is added: a notifier block is registered into the memory hotplug
notifier chain for the ipc subsystem.  Since the ipc namespaces are not linked
together, they have their own notification chain: one notifier_block is
defined per ipc namespace.  Each time an ipc namespace is created (removed) it
registers (unregisters) its notifier block in (from) the ipcns chain.  The
callback routine registered in the memory chain invokes the ipcns notifier
chain with the IPCNS_MEMCHANGE event.  Each callback routine registered in the
ipcns namespace, in turn, recomputes msgmni for the owning namespace.

The 5th patch makes it possible to keep the memory hotplug notifier chain's
lock for a lesser amount of time: instead of directly notifying the ipcns
notifier chain upon memory add/remove, a work item is added to the global
workqueue.  When activated, this work item is the one who notifies the ipcns
notifier chain.

Since msgmni depends on the number of ipc namespaces, it becomes necessary to
recompute it upon ipc namespace creation / removal.  The 6th patch uses the
ipc namespace notifier chain for that purpose: that chain is notified each
time an ipc namespace is created or removed.  This makes it possible to
recompute msgmni for all the namespaces each time one of them is created or
removed.

When msgmni is explicitely set from userspace, we should avoid recomputing it
upon memory add/remove or ipcns creation/removal.  This is what the 7th patch
does: it simply unregisters the ipcns callback routine as soon as msgmni has
been changed from procfs or sysctl().

Even if msgmni is set by hand, it should be possible to make it back
automatically recomputed upon memory add/remove or ipcns creation/removal.
This what is achieved in patch 8: if set to a negative value, msgmni is added
back to the ipcns notifier chain, making it automatically recomputed again.

This patch:

Compute msg_ctlmni to make it scale with the amount of lowmem.  msg_ctlmni is
now set to make the message queues occupy 1/32 of the available lowmem.

Some cleaning has also been done for the MSGPOOL constant: the msgctl man page
says it's not used, but it also defines it as a size in bytes (the code
expresses it in Kbytes).

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoIPC: use ipc_buildid() directly from ipc_addid()
Pierre Peiffer [Tue, 29 Apr 2008 08:00:35 +0000 (01:00 -0700)] 
IPC: use ipc_buildid() directly from ipc_addid()

By continuing to consolidate a little the IPC code, each id can be built
directly in ipc_addid() instead of having it built from each callers of
ipc_addid()

And I also remove shm_addid() in order to have, as much as possible, the
same code for shm/sem/msg.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodoc: fix DMA-API function parameters
Randy Dunlap [Tue, 29 Apr 2008 08:00:35 +0000 (01:00 -0700)] 
doc: fix DMA-API function parameters

Fix kernel bugzilla #10388.

DMA-API.txt has wrong argument type for some functions.  It uses struct device
but should use struct pci_dev.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoIB: expand ib_umem_get() prototype
Arthur Kepner [Tue, 29 Apr 2008 08:00:34 +0000 (01:00 -0700)] 
IB: expand ib_umem_get() prototype

Add a new parameter, dmasync, to the ib_umem_get() prototype.  Use dmasync = 1
when mapping user-allocated CQs with ib_umem_get().

Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodma/ia64: update ia64 machvecs, swiotlb.c
Arthur Kepner [Tue, 29 Apr 2008 08:00:32 +0000 (01:00 -0700)] 
dma/ia64: update ia64 machvecs, swiotlb.c

Change all ia64 machvecs to use the new dma_*map*_attrs() interfaces.
Implement the old dma_*map_*() interfaces in terms of the corresponding new
interfaces.  For ia64/sn, make use of one dma attribute,
DMA_ATTR_WRITE_BARRIER.  Introduce swiotlb_*map*_attrs() functions.

Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodma: document dma_*map*_attrs() interfaces
Arthur Kepner [Tue, 29 Apr 2008 08:00:31 +0000 (01:00 -0700)] 
dma: document dma_*map*_attrs() interfaces

Document the new dma_*map*_attrs() functions.

[markn@au1.ibm.com: fix up for dma-add-dma_map_attrs-interfaces and update docs]
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodma: add dma_*map*_attrs() interfaces
Arthur Kepner [Tue, 29 Apr 2008 08:00:30 +0000 (01:00 -0700)] 
dma: add dma_*map*_attrs() interfaces

Introduce new interfaces, dma_*map*_attrs(), for passing architecture-specific
attributes when memory is mapped and unmapped for DMA.  Give the interfaces
default implementations which ignore attributes.  Also introduce the
dma_{set|get}_attr() interfaces for setting and retrieving individual
attributes.  Define one attribute, DMA_ATTR_WRITE_BARRIER, in anticipation of
its use by ia64/sn.  Select whether architectures implement arch-specific
versions of the dma_*map*_attrs() interfaces via HAVE_DMA_ATTRS in Kconfig.

[markn@au1.ibm.com: dma_{set,get}_attr() have to be static inline]
Signed-off-by: Arthur Kepner <akepner@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Roland Dreier <rdreier@cisco.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agosimplify cpu_hotplug_begin()/put_online_cpus()
Oleg Nesterov [Tue, 29 Apr 2008 08:00:29 +0000 (01:00 -0700)] 
simplify cpu_hotplug_begin()/put_online_cpus()

cpu_hotplug_begin() must be always called under cpu_add_remove_lock, this
means that only one process can be cpu_hotplug.active_writer.  So we don't
need the cpu_hotplug.writer_queue, we can wake up the ->active_writer
directly.

Also, fix the comment.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Acked-by: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocleanup_workqueue_thread: remove the unneeded "cpu" parameter
Oleg Nesterov [Tue, 29 Apr 2008 08:00:28 +0000 (01:00 -0700)] 
cleanup_workqueue_thread: remove the unneeded "cpu" parameter

cleanup_workqueue_thread() doesn't need the second argument, remove it.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoworkqueues: shrink cpu_populated_map when CPU dies
Oleg Nesterov [Tue, 29 Apr 2008 08:00:27 +0000 (01:00 -0700)] 
workqueues: shrink cpu_populated_map when CPU dies

When cpu_populated_map was introduced, it was supposed that cwq->thread can
survive after CPU_DEAD, that is why we never shrink cpu_populated_map.

This is not very nice, we can safely remove the already dead CPU from the map.
 The only required change is that destroy_workqueue() must hold the hotplug
lock until it destroys all cwq->thread's, to protect the cpu_populated_map.
We could make the local copy of cpu mask and drop the lock, but
sizeof(cpumask_t) may be very large.

Also, fix the comment near queue_work().  Unless _cpu_down() happens we do
guarantee the cpu-affinity of the work_struct, and we have users which rely on
this.

[akpm@linux-foundation.org: repair comment]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCpuset hardwall flag: add a mem_hardwall flag to cpusets
Paul Menage [Tue, 29 Apr 2008 08:00:26 +0000 (01:00 -0700)] 
Cpuset hardwall flag: add a mem_hardwall flag to cpusets

This flag provides the hardwalling properties of mem_exclusive, without
enforcing the exclusivity.  Either mem_hardwall or mem_exclusive is sufficient
to prevent GFP_KERNEL allocations from passing outside the cpuset's assigned
nodes.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCpuset hardwall flag: switch cpusets to use the bulk cgroup_add_files() API
Paul Menage [Tue, 29 Apr 2008 08:00:26 +0000 (01:00 -0700)] 
Cpuset hardwall flag: switch cpusets to use the bulk cgroup_add_files() API

Currently the cpusets mem_exclusive flag is overloaded to mean both
"no-overlapping" and "no GFP_KERNEL allocations outside this cpuset".

These patches add a new mem_hardwall flag with just the allocation restriction
part of the mem_exclusive semantics, without breaking backwards-compatibility
for those who continue to use just mem_exclusive.  Additionally, the cgroup
control file registration for cpusets is cleaned up to reduce boilerplate.

This patch:

This change tidies up the cpusets control file definitions, and reduces the
amount of boilerplate required to add/change control files in the future.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agokernel/cpuset.c: make 3 functions static
Adrian Bunk [Tue, 29 Apr 2008 08:00:25 +0000 (01:00 -0700)] 
kernel/cpuset.c: make 3 functions static

Make the following needlessly global functions static:

- cpuset_test_cpumask()
- cpuset_change_cpumask()
- cpuset_do_move_task()

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcg: remove redundant initialization in mem_cgroup_create()
Li Zefan [Tue, 29 Apr 2008 08:00:24 +0000 (01:00 -0700)] 
memcg: remove redundant initialization in mem_cgroup_create()

*mem has been zeroed, that means mem->info has already been filled with 0.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: use vmalloc for mem_cgroup allocation
KAMEZAWA Hiroyuki [Tue, 29 Apr 2008 08:00:24 +0000 (01:00 -0700)] 
memcgroup: use vmalloc for mem_cgroup allocation

On ia64, this kmalloc() requires order-4 pages.  But this is not necessary to
be physically contiguous.  For big mem_cgroup, vmalloc is better.  For small
ones, kmalloc is used.

[akpm@linux-foundation.org: simplification]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: make the memory controller more desktop responsive
Balbir Singh [Tue, 29 Apr 2008 08:00:23 +0000 (01:00 -0700)] 
memcgroup: make the memory controller more desktop responsive

This patch makes the memory controller more responsive on my desktop.

1. Set all cached pages as inactive.  We were by default marking all pages
   as active, thus forcing us to go through two passes for reclaiming pages

2. Remove congestion_wait(), since we already have that logic in
   do_try_to_free_pages()

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcg: remove redundant function calls
KAMEZAWA Hiroyuki [Tue, 29 Apr 2008 08:00:22 +0000 (01:00 -0700)] 
memcg: remove redundant function calls

remove_list/add_list uses page_cgroup_zoneinfo() in it.

So, it's called twice before and after lock.

mz = page_cgroup_zoneinfo();
lock();
mz = page_cgroup_zoneinfo();
....
unlock();

And address of mz never changes.

This is not good. This patch fixes this behavior.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: implement failcounter reset
Pavel Emelyanov [Tue, 29 Apr 2008 08:00:21 +0000 (01:00 -0700)] 
memcgroup: implement failcounter reset

This is a very common requirement from people using the resource accounting
facilities (not only memcgroup but also OpenVZ beancounters).  They want to
put the cgroup in an initial state without re-creating it.

For example after re-configuring a group people want to observe how this new
configuration fits the group needs without saving the previous failcnt value.

Merge two resets into one mem_cgroup_reset() function to demonstrate how
multiplexing work.

Besides, I have plans to move the files, that correspond to res_counter to the
res_counter.c file and somehow "import" them into controller.  I don't know
how to make it gracefully yet, but merging resets of max_usage and failcnt in
one function will be there for sure.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: use triggers in force_empty and max_usage files
Pavel Emelyanov [Tue, 29 Apr 2008 08:00:20 +0000 (01:00 -0700)] 
memcgroup: use triggers in force_empty and max_usage files

These two files are essentially event callbacks.  They do not care about the
contents of the string, but only about the fact of the write itself.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: move memory controller allocations to their own slabs
Balbir Singh [Tue, 29 Apr 2008 08:00:19 +0000 (01:00 -0700)] 
memcgroup: move memory controller allocations to their own slabs

Move the memory controller data structure page_cgroup to its own slab cache.
It saves space on the system, allocations are not necessarily pushed to order
of 2 and should provide performance benefits.  Users who disable the memory
controller can also double check that the memory controller is not allocating
page_cgroup's.

NOTE: Hugh Dickins brought up the issue of whether we want to mark page_cgroup
as __GFP_MOVABLE or __GFP_RECLAIMABLE.  I don't think there is an easy answer
at the moment.  page_cgroup's are associated with user pages, they can be
reclaimed once the user page has been reclaimed, so it might make sense to
mark them as __GFP_RECLAIMABLE.  For now, I am leaving the marking to default
values that the slab allocator uses.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroups: add a document describing the resource counter abstraction
Pavel Emelyanov [Tue, 29 Apr 2008 08:00:18 +0000 (01:00 -0700)] 
memcgroups: add a document describing the resource counter abstraction

The resource counter is supposed to facilitate the resource accounting of
arbitrary resource (and it already does this for memory controller).

However, it is about to be used in other resources controllers (swap, kernel
memory, networking, etc), so provide a doc describing how to work with it.
This will eliminate all the possible future duplications in the appropriate
controllers' docs.

Fixed errors pointed out by Randy.

[akpm@linux-foundation.org: fix documentation tpyo]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemcgroup: add the max_usage member on the res_counter
Pavel Emelyanov [Tue, 29 Apr 2008 08:00:17 +0000 (01:00 -0700)] 
memcgroup: add the max_usage member on the res_counter

This field is the maximal value of the usage one since the counter creation
(or since the latest reset).

To reset this to the usage value simply write anything to the appropriate
cgroup file.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: add an owner to the mm_struct
Balbir Singh [Tue, 29 Apr 2008 08:00:16 +0000 (01:00 -0700)] 
cgroups: add an owner to the mm_struct

Remove the mem_cgroup member from mm_struct and instead adds an owner.

This approach was suggested by Paul Menage.  The advantage of this approach
is that, once the mm->owner is known, using the subsystem id, the cgroup
can be determined.  It also allows several control groups that are
virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller, to
mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm->owner
changes.  The mm_cgroup_changed callback is called with the task_lock() of
the new task held and is called just prior to changing the mm->owner.

I am indebted to Paul Menage for the several reviews of this patchset and
helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would be
redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>,
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: introduce cft->read_seq()
Serge E. Hallyn [Tue, 29 Apr 2008 08:00:14 +0000 (01:00 -0700)] 
cgroups: introduce cft->read_seq()

Introduce a read_seq() helper in cftype, which uses seq_file to print out
lists.  Use it in the devices cgroup.  Also split devices.allow into two
files, so now devices.deny and devices.allow are the ones to use to manipulate
the whitelist, while devices.list outputs the cgroup's current whitelist.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: remove the css_set linked-list
Li Zefan [Tue, 29 Apr 2008 08:00:13 +0000 (01:00 -0700)] 
cgroups: remove the css_set linked-list

Now we can run through the hash table instead of running through the
linked-list.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: simplify init_subsys()
Li Zefan [Tue, 29 Apr 2008 08:00:13 +0000 (01:00 -0700)] 
cgroups: simplify init_subsys()

We are at system boot and there is only 1 cgroup group (i,e, init_css_set), so
we don't need to run through the css_set linked list.  Neither do we need to
run through the task list, since no processes have been created yet.

Also referring to a comment in cgroup.h:

struct css_set
{
...
/*
 * Set of subsystem states, one for each subsystem. This array
 * is immutable after creation apart from the init_css_set
 * during subsystem registration (at boot time).
 */
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
}

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: use a hash table for css_set finding
Li Zefan [Tue, 29 Apr 2008 08:00:11 +0000 (01:00 -0700)] 
cgroups: use a hash table for css_set finding

When we attach a process to a different cgroup, the css_set linked-list will
be run through to find a suitable existing css_set to use.  This patch
implements a hash table for better performance.

The following benchmarks have been tested:

For N in 1, 5, 10, 50, 100, 500, 1000, create N cgroups with one sleeping
task in each, and then move an additional task through each cgroup in
turn.

Here is a test result:

N Loop orig - Time(s) hash - Time(s)
----------------------------------------------
1 10000 1.201231728 1.196311177
5 2000 1.065743872 1.040566424
10 1000 0.991054735 0.986876440
50 200 0.976554203 0.969608733
100 100 0.998504680 0.969218270
500 20 1.157347764 0.962602963
1000 10 1.619521852 1.085140172

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: implement device whitelist
Serge E. Hallyn [Tue, 29 Apr 2008 08:00:10 +0000 (01:00 -0700)] 
cgroups: implement device whitelist

Implement a cgroup to track and enforce open and mknod restrictions on device
files.  A device cgroup associates a device access whitelist with each cgroup.
 A whitelist entry has 4 fields.  'type' is a (all), c (char), or b (block).
'all' means it applies to all types and all major and minor numbers.  Major
and minor are either an integer or * for all.  Access is a composition of r
(read), w (write), and m (mknod).

The root device cgroup starts with rwm to 'all'.  A child devcg gets a copy of
the parent.  Admins can then remove devices from the whitelist or add new
entries.  A child cgroup can never receive a device access which is denied its
parent.  However when a device access is removed from a parent it will not
also be removed from the child(ren).

An entry is added using devices.allow, and removed using
devices.deny.  For instance

echo 'c 1:3 mr' > /cgroups/1/devices.allow

allows cgroup 1 to read and mknod the device usually known as
/dev/null.  Doing

echo a > /cgroups/1/devices.deny

will remove the default 'a *:* mrw' entry.

CAP_SYS_ADMIN is needed to change permissions or move another task to a new
cgroup.  A cgroup may not be granted more permissions than the cgroup's parent
has.  Any task can move itself between cgroups.  This won't be sufficient, but
we can decide the best way to adequately restrict movement later.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix may-be-used-uninitialized warning]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Looks-good-to: Pavel Emelyanov <xemul@openvz.org>
Cc: Daniel Hokka Zakrisson <daniel@hozac.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: add the trigger callback to struct cftype
Pavel Emelyanov [Tue, 29 Apr 2008 08:00:08 +0000 (01:00 -0700)] 
cgroups: add the trigger callback to struct cftype

Trigger callback can be used to receive a kick-up from the user space.  The
string written is ignored.

The cftype->private is used for multiplexing events.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Paul Menage <menage@google.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroup: switch to proc_create()
Li Zefan [Tue, 29 Apr 2008 08:00:08 +0000 (01:00 -0700)] 
cgroup: switch to proc_create()

There is a race between create_proc_entry() and the assignment of file ops.
proc_create() is invented to fix it.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroup: annotate cgroup_init_subsys with __init
Li Zefan [Tue, 29 Apr 2008 08:00:07 +0000 (01:00 -0700)] 
cgroup: annotate cgroup_init_subsys with __init

It is called by cgroup_init() and cgroup_init_early() only, which are
annotated with __init.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroups _s64 files: use read_s64/write_s64 in CFS cgroup for rt_runtime file
Paul Menage [Tue, 29 Apr 2008 08:00:06 +0000 (01:00 -0700)] 
CGroups _s64 files: use read_s64/write_s64 in CFS cgroup for rt_runtime file

This removes some filesystem boilerplate from the CFS cgroup subsystem.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroups _s64 files: add cgroups read_s64/write_s64 file methods
Paul Menage [Tue, 29 Apr 2008 08:00:06 +0000 (01:00 -0700)] 
CGroups _s64 files: add cgroups read_s64/write_s64 file methods

These patches add cgroups read_s64 and write_s64 control file methods (the
signed equivalent of read_u64/write_u64) and use them to implement the
cpu.rt_runtime_us control file in the CFS cgroup subsystem.

This patch:

These are the signed equivalents of the read_u64/write_u64 methods

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: make CGROUP_DEBUG default to off
Paul Menage [Tue, 29 Apr 2008 08:00:05 +0000 (01:00 -0700)] 
CGroup API files: make CGROUP_DEBUG default to off

The cgroup debug subsystem isn't generally useful for users.  It should
default to "n".

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: move "releasable" to cgroup_debug subsystem
Paul Menage [Tue, 29 Apr 2008 08:00:04 +0000 (01:00 -0700)] 
CGroup API files: move "releasable" to cgroup_debug subsystem

The "releasable" control file provided by the cgroup framework exports the
state of a per-cgroup flag that's related to the notify-on-release feature.
This isn't really generally useful, unless you're trying to debug this
particular feature of cgroups.

This patch moves the "releasable" file to the cgroup_debug subsystem.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: drop mem_cgroup_force_empty()
Paul Menage [Tue, 29 Apr 2008 08:00:03 +0000 (01:00 -0700)] 
CGroup API files: drop mem_cgroup_force_empty()

This function isn't needed - a NULL pointer in the cftype read function will
result in the same EINVAL response to userspace.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: use cgroup map for memcontrol stats file
Paul Menage [Tue, 29 Apr 2008 08:00:02 +0000 (01:00 -0700)] 
CGroup API files: use cgroup map for memcontrol stats file

Remove the seq_file boilerplate used to construct the memcontrol stats map,
and instead use the new map representation for cgroup control files

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: add cgroup map data type
Paul Menage [Tue, 29 Apr 2008 08:00:01 +0000 (01:00 -0700)] 
CGroup API files: add cgroup map data type

Adds a new type of supported control file representation, a map from strings
to u64 values.

Each map entry is printed as a line in a similar format to /proc/vmstat, i.e.
"$key $value\n"

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: update cpusets to use cgroup structured file API
Paul Menage [Tue, 29 Apr 2008 08:00:00 +0000 (01:00 -0700)] 
CGroup API files: update cpusets to use cgroup structured file API

Many of the cpusets control files are simple integer values, which don't
require the overhead of memory allocations for reads and writes.

Move the handlers for these control files into cpuset_read_u64() and
cpuset_write_u64().

[akpm@linux-foundation.org: ad dmissing `break']
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: strip all trailing whitespace in cgroup_write_u64
Paul Menage [Tue, 29 Apr 2008 07:59:59 +0000 (00:59 -0700)] 
CGroup API files: strip all trailing whitespace in cgroup_write_u64

This removes the need for people to remember to pass the -n flag to echo when
writing values to cgroup control files.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: use read_u64 in memory controller
Paul Menage [Tue, 29 Apr 2008 07:59:58 +0000 (00:59 -0700)] 
CGroup API files: use read_u64 in memory controller

Update the memory controller to use read_u64 for its limit/usage/failcnt
control files, calling the new res_counter_read_u64() function.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: add res_counter_read_u64()
Paul Menage [Tue, 29 Apr 2008 07:59:58 +0000 (00:59 -0700)] 
CGroup API files: add res_counter_read_u64()

Adds a function for returning the value of a resource counter member, in a
form suitable for use in a cgroup read_u64 control file method.

Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoCGroup API files: rename read/write_uint methods to read_write_u64
Paul Menage [Tue, 29 Apr 2008 07:59:56 +0000 (00:59 -0700)] 
CGroup API files: rename read/write_uint methods to read_write_u64

Several people have justifiably complained that the "_uint" suffix is
inappropriate for functions that handle u64 values, so this patch just renames
all these functions and their users to have the suffic _u64.

[peterz@infradead.org: build fix]
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Li Zefan" <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "YAMAMOTO Takashi" <yamamoto@valinux.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroups: kernel/ns_cgroup.c should #include <linux/nsproxy.h>
Adrian Bunk [Tue, 29 Apr 2008 07:59:55 +0000 (00:59 -0700)] 
cgroups: kernel/ns_cgroup.c should #include <linux/nsproxy.h>

Every file should include the headers containing the externs its global
functions (in this case for ns_cgroup_clone()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocgroup: fix sparse warning of shadow symbol in cgroup.c
Paul Jackson [Tue, 29 Apr 2008 07:59:55 +0000 (00:59 -0700)] 
cgroup: fix sparse warning of shadow symbol in cgroup.c

Fix a code warning: symbol 'p' shadows an earlier one

This is a reincarnation of Harvey Harrison's patch:
cpuset: sparse warnings in cpuset.c

Independently, Cliff Wickman moved the affected code,
from kernel/cpuset.c to kernel/cgroup.c, in his patch:
cpusets: update_cpumask revision

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Cliff Wickman <cpw@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomake cgroup_enable_task_cg_lists() static
Adrian Bunk [Tue, 29 Apr 2008 07:59:54 +0000 (00:59 -0700)] 
make cgroup_enable_task_cg_lists() static

Make the needlessly global cgroup_enable_task_cg_lists() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agox86: olpc: add One Laptop Per Child architecture support
Andres Salomon [Tue, 29 Apr 2008 07:59:53 +0000 (00:59 -0700)] 
x86: olpc: add One Laptop Per Child architecture support

This adds support for OLPC XO hardware.  Open Firmware on XOs don't contain
the VSA, so it is necessary to emulate the PCI BARs in the kernel.  This also
adds functionality for running EC commands, and a CONFIG_OLPC.

A number of OLPC drivers depend upon CONFIG_OLPC.

olpc_ec_timeout is a hack to work around Embedded Controller bugs.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: geode_has_vsa build fix]
[akpm@linux-foundation.org: olpc_register_battery_callback doesn't exist]
Signed-off-by: Andres Salomon <dilinger@debian.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@suse.de>
Cc: Jordan Crouse <jordan.crouse@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoeCryptfs: protect crypt_stat->flags in ecryptfs_open()
Michael Halcrow [Tue, 29 Apr 2008 07:59:52 +0000 (00:59 -0700)] 
eCryptfs: protect crypt_stat->flags in ecryptfs_open()

Make sure crypt_stat->flags is protected with a lock in ecryptfs_open().

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoeCryptfs: make key module subsystem respect namespaces
Michael Halcrow [Tue, 29 Apr 2008 07:59:52 +0000 (00:59 -0700)] 
eCryptfs: make key module subsystem respect namespaces

Make eCryptfs key module subsystem respect namespaces.

Since I will be removing the netlink interface in a future patch, I just made
changes to the netlink.c code so that it will not break the build.  With my
recent patches, the kernel module currently defaults to the device handle
interface rather than the netlink interface.

[akpm@linux-foundation.org: export free_user_ns()]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoeCryptfs: integrate eCryptfs device handle into the module.
Michael Halcrow [Tue, 29 Apr 2008 07:59:51 +0000 (00:59 -0700)] 
eCryptfs: integrate eCryptfs device handle into the module.

Update the versioning information.  Make the message types generic.  Add an
outgoing message queue to the daemon struct.  Make the functions to parse
and write the packet lengths available to the rest of the module.  Add
functions to create and destroy the daemon structs.  Clean up some of the
comments and make the code a little more consistent with itself.

[akpm@linux-foundation.org: printk fixes]
Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoeCryptfs: introduce device handle for userspace daemon communications
Michael Halcrow [Tue, 29 Apr 2008 07:59:50 +0000 (00:59 -0700)] 
eCryptfs: introduce device handle for userspace daemon communications

A regular device file was my real preference from the get-go, but I went with
netlink at the time because I thought it would be less complex for managing
send queues (i.e., just do a unicast and move on).  It turns out that we do
not really get that much complexity reduction with netlink, and netlink is
more heavyweight than a device handle.

In addition, the netlink interface to eCryptfs has been broken since 2.6.24.
I am assuming this is a bug in how eCryptfs uses netlink, since the other
in-kernel users of netlink do not seem to be having any problems.  I have had
one report of a user successfully using eCryptfs with netlink on 2.6.24, but
for my own systems, when starting the userspace daemon, the initial helo
message sent to the eCryptfs kernel module results in an oops right off the
bat.  I spent some time looking at it, but I have not yet found the cause.
The netlink interface breaking gave me the motivation to just finish my patch
to migrate to a regular device handle.  If I cannot find out soon why the
netlink interface in eCryptfs broke, I am likely to just send a patch to
disable it in 2.6.24 and 2.6.25.  I would like the device handle to be the
preferred means of communicating with the userspace daemon from 2.6.26 on
forward.

This patch:

Functions to facilitate reading and writing to the eCryptfs miscellaneous
device handle.  This will replace the netlink interface as the preferred
mechanism for communicating with the userspace eCryptfs daemon.

Each user has his own daemon, which registers itself by opening the eCryptfs
device handle.  Only one daemon per euid may be registered at any given time.
The eCryptfs module sends a message to a daemon by adding its message to the
daemon's outgoing message queue.  The daemon reads the device handle to get
the oldest message off the queue.

Incoming messages from the userspace daemon are immediately handled.  If the
message is a response, then the corresponding process that is blocked waiting
for the response is awakened.

Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoecryptfs: add missing lock around notify_change
Miklos Szeredi [Tue, 29 Apr 2008 07:59:48 +0000 (00:59 -0700)] 
ecryptfs: add missing lock around notify_change

Callers of notify_change() need to hold i_mutex.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoecryptfs: replace remaining __FUNCTION__ occurrences
Harvey Harrison [Tue, 29 Apr 2008 07:59:48 +0000 (00:59 -0700)] 
ecryptfs: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoremove ecryptfs_header_cache_0
Adrian Bunk [Tue, 29 Apr 2008 07:59:47 +0000 (00:59 -0700)] 
remove ecryptfs_header_cache_0

Remove the no longer used ecryptfs_header_cache_0.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoxen: make blkif_getgeo static
Harvey Harrison [Tue, 29 Apr 2008 07:59:47 +0000 (00:59 -0700)] 
xen: make blkif_getgeo static

Introduced between 2.6.25-rc2 and -rc3
drivers/block/xen-blkfront.c:139:5: warning: symbol 'blkif_getgeo' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovt: fix background color on line feed
Jan Engelhardt [Tue, 29 Apr 2008 07:59:46 +0000 (00:59 -0700)] 
vt: fix background color on line feed

A command that causes a line feed while a background color is active,
such as

perl -e 'print "x" x 60, "\e[44m", "x" x 40, "\e[0m\n"'
and
perl -e 'print "x" x 40, "\e[44m\n", "x" x 40, "\e[0m\n"'

causes the line that was started as a result of the line feed to be completely
filled with the currently active background color instead of the default
color.

When scrolling, part of the current screen is memcpy'd/memmove'd to the new
region, and the new line(s) that will appear as a result are cleared using
memset.  However, the lines are cleared with vc->vc_video_erase_char, causing
them to be colored with the currently active background color.  This is
different from X11 terminal emulators which always paint the new lines with
the default background color (e.g.  `xterm -bg black`).

The clear operation (\e[1J and \e[2J) also use vc_video_erase_char, so a new
vc->vc_scrl_erase_char is introduced with contains the erase character used
for scrolling, which is built from vc->vc_def_color instead of vc->vc_color.

Signed-off-by: Jan Engelhardt <jengelh@computergmbh.de>
Cc: "Antonino A. Daplas" <adaplas@pol.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodirectly use kmalloc() and kfree() in init/initramfs.c
Thomas Petazzoni [Tue, 29 Apr 2008 07:59:43 +0000 (00:59 -0700)] 
directly use kmalloc() and kfree() in init/initramfs.c

Instead of using the malloc() and free() wrappers needed by the
lib/inflate.c code for allocations, simply use kmalloc() and kfree() in the
initramfs code.  This is needed for a further lib/inflate.c-related cleanup
patch that will remove the malloc() and free() functions.

Take that opportunity to remove the useless kmalloc() return value
cast.

Based on work done by Matt Mackall.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoisolate ratelimit from printk.c for other use
Dave Young [Tue, 29 Apr 2008 07:59:43 +0000 (00:59 -0700)] 
isolate ratelimit from printk.c for other use

Due to the rcupreempt.h WARN_ON trigged, I got 2G syslog file.  For some
serious complaining of kernel, we need repeat the warnings, so here I isolate
the ratelimit part of printk.c to a standalone file.

Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovfs: fix unconditional write_super() call in file_fsync()
OGAWA Hirofumi [Tue, 29 Apr 2008 07:59:42 +0000 (00:59 -0700)] 
vfs: fix unconditional write_super() call in file_fsync()

We need to check ->s_dirt before calling write_super().  It became the cause
of an unneeded write.

This bug was noticed by Sudhanshu Saxena.

Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoxattr: add missing consts to function arguments
David Howells [Tue, 29 Apr 2008 07:59:41 +0000 (00:59 -0700)] 
xattr: add missing consts to function arguments

Add missing consts to xattr function arguments.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovfs: remove lives_below_in_same_fs()
Jan Blunck [Tue, 29 Apr 2008 07:59:40 +0000 (00:59 -0700)] 
vfs: remove lives_below_in_same_fs()

Remove lives_below_in_same_fs() since is_subdir() from fs/dcache.c is
providing the same functionality.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofs/inode.c: use hlist_for_each_entry()
Matthias Kaehlcke [Tue, 29 Apr 2008 07:59:40 +0000 (00:59 -0700)] 
fs/inode.c: use hlist_for_each_entry()

fs/inode.c: use hlist_for_each_entry() in find_inode() and find_inode_fast()

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Matthias Kaehlcke <matthias@kaehlcke.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovfs: skip inodes without pages to free in drop_pagecache_sb()
Jan Kara [Tue, 29 Apr 2008 07:59:39 +0000 (00:59 -0700)] 
vfs: skip inodes without pages to free in drop_pagecache_sb()

Many inodes have no pagecache, so we can avoid lots of lock-takings.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovfs: fix lock inversion in drop_pagecache_sb()
Jan Kara [Tue, 29 Apr 2008 07:59:37 +0000 (00:59 -0700)] 
vfs: fix lock inversion in drop_pagecache_sb()

Fix longstanding lock inversion in drop_pagecache_sb by dropping inode_lock
before calling __invalidate_mapping_pages().  We just have to make sure inode
won't go away from under us by keeping reference to it and putting the
reference only after we have safely resumed the scan of the inode list.  A bit
tricky but not too bad...

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: David Chinner <dgc@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoswiotlb: use iommu_is_span_boundary helper function
FUJITA Tomonori [Tue, 29 Apr 2008 07:59:36 +0000 (00:59 -0700)] 
swiotlb: use iommu_is_span_boundary helper function

iommu_is_span_boundary in lib/iommu-helper.c was exported for PARISC IOMMUs
(commit 3715863aa142c4f4c5208f5f3e5e9bac06006d2f).  SWIOTLB can use it instead
of the homegrown function.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agolib/swiotlb.c: cleanups
Andrew Morton [Tue, 29 Apr 2008 07:59:36 +0000 (00:59 -0700)] 
lib/swiotlb.c: cleanups

There's a pointlessly braced block of code in there.  Remove the braces and
save a tabstop.

Cc: Andi Kleen <ak@suse.de>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofirmware loader: printk when requesting firmware
Ciaran McCreesh [Tue, 29 Apr 2008 07:59:35 +0000 (00:59 -0700)] 
firmware loader: printk when requesting firmware

Before requesting firmware, printk a message saying what we're requesting. This
makes it easier to see what's going on, and provides an explanation for the
huge silent delay that one would otherwise get after accidentally building
ipw2200 as a non-module.

Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoMAINTAINERS: clarify status of MN10300 mailing list as moderated
Robert P. J. Day [Tue, 29 Apr 2008 07:59:34 +0000 (00:59 -0700)] 
MAINTAINERS: clarify status of MN10300 mailing list as moderated

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofdpic: check that the size returned by kernel_read() is what we asked for
David Howells [Tue, 29 Apr 2008 07:59:34 +0000 (00:59 -0700)] 
fdpic: check that the size returned by kernel_read() is what we asked for

Check that the size of the read returned by kernel_read() is what we asked
for.  If it isn't, then reject the binary as being a badly formatted.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agosmb.h: uses struct timespec but didn't include linux/time.h
Ilpo Järvinen [Tue, 29 Apr 2008 07:59:33 +0000 (00:59 -0700)] 
smb.h: uses struct timespec but didn't include linux/time.h

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoupdate checkpatch.pl to version 0.18
Andy Whitcroft [Tue, 29 Apr 2008 07:59:33 +0000 (00:59 -0700)] 
update checkpatch.pl to version 0.18

This version brings a few fixes for the extern checks, and a couple of
new checks.

Of note:
 - false is now recognised as a 0 assignment in static/external
   assignments,
 - printf format strings including %L are reported,
 - a number of fixes for the extern in .c file detector which had
   temporarily lost its ability to detect variables; undetected due to
   the loss of its test.

Andy Whitcroft (8):
      Version: 0.18
      false should trip 0 assignment checks
      tests: reinstate missing tests
      tests: allow specification of the file extension for a test
      fix extern checks for variables
      check for and report %Lu, %Ld, and %Li
      ensure we only start a statement on lines with some content
      extern spacing

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoupdate checkpatch.pl to version 0.17
Andy Whitcroft [Tue, 29 Apr 2008 07:59:32 +0000 (00:59 -0700)] 
update checkpatch.pl to version 0.17

This version brings improvements to external declaration detection, fixes to
quote tracking, fixes to unary tracking, some clarification of wording, and
the usual slew of fixes for false positives.

Of note:
 - much better unary tracking across preprocessor directives
 - UTF8 checks highlight the character at fault
 - widening of mutex detection

Andy Whitcroft (17):
      Version: 0.17
      values: __attribute__ carries through the previous type
      quotes: should only follow "positive" lines
      clarify the indent tabs over spaces wording
      loosen NR_CPUS check for array range initialisers
      detect external function declarations without an extern prefix
      function declaration arguments should be with the identifier
      DEFINE_MUTEX should report in line with struct mutex
      NR_CPUS is valid in preprocessor statements
      comment detection should not start on the @@ line
      types: add support for #undef
      tighten mutex/completion reports to usage
      allow export of function pointers
      values: preprocessor #define is out of line maintain values
      values: #define does not always have parentheses
      unary '*' may be const
      utf8 checks should report location of the invalid character

Wolfram Sang (1):
      make checkpatch.pl really skip <asm/irq.h>

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoscripts/Lindent: support gnu indent v2.2.10
Joe Perches [Tue, 29 Apr 2008 07:59:31 +0000 (00:59 -0700)] 
scripts/Lindent: support gnu indent v2.2.10

The new version of indent supports positioning labels in column 1
using "-il0"

http://www.nabble.com/Release-2.2.10-of-GNU-Indent-td15990700.html

Add "-il0" if indent version >= 2.2.10

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodrivers/misc: elide a non-zero test on a result that is never 0
Julia Lawall [Tue, 29 Apr 2008 07:59:30 +0000 (00:59 -0700)] 
drivers/misc: elide a non-zero test on a result that is never 0

The function thermal_cooling_device_register always returns either a valid
pointer or a value made with ERR_PTR, so a test for non-zero on the result
will always succeed.

The problem was found using the following semantic match.
(http://www.emn.fr/x-info/coccinelle/)

//<smpl>
@a@
expression E, E1;
statement S,S1;
position p;
@@

E = thermal_cooling_device_register(...)
... when != E = E1
if@p (E) S else S1

@n@
position a.p;
expression E,E1;
statement S,S1;
@@

E = NULL
... when != E = E1
if@p (E) S else S1

@depends on !n@
expression E;
statement S,S1;
position a.p;
@@

* if@p (E)
  S else S1
//</smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Thomas Sujith <sujith.thomas@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agolists: add "const" qualifier to first arg of list_splice() operations
Robert P. J. Day [Tue, 29 Apr 2008 07:59:29 +0000 (00:59 -0700)] 
lists: add "const" qualifier to first arg of list_splice() operations

Since neither the list_splice() nor __list_splice() routines modify their
first argument, might as well declare them "const".

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agokbuild: move files that don't check __KERNEL__
Robert P. J. Day [Tue, 29 Apr 2008 07:59:28 +0000 (00:59 -0700)] 
kbuild: move files that don't check __KERNEL__

Move files that don't check __KERNEL__ from unifdef-y to header-y.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agokbuild: remove duplicate, conflicting entry for oom.h
Robert P. J. Day [Tue, 29 Apr 2008 07:59:28 +0000 (00:59 -0700)] 
kbuild: remove duplicate, conflicting entry for oom.h

oom.h is already tagged for unifdef'ing, so its entry as a simple exportable
header should be deleted.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofs/binfmt_aout.c: use printk_ratelimit()
S.Caglar Onur [Tue, 29 Apr 2008 07:59:26 +0000 (00:59 -0700)] 
fs/binfmt_aout.c: use printk_ratelimit()

Use printk_ratelimit() instead of jiffies based arithmetic, suggested by Geert
Uytterhoeven

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoRemove superfluous include of string.h from percpu.h
Robert P. J. Day [Tue, 29 Apr 2008 07:59:25 +0000 (00:59 -0700)] 
Remove superfluous include of string.h from percpu.h

There's nothing in percpu.h that requires an explicit inclusion of
string.h.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agokernel: explicitly include required header files under kernel/
Robert P. J. Day [Tue, 29 Apr 2008 07:59:25 +0000 (00:59 -0700)] 
kernel: explicitly include required header files under kernel/

Following an experimental deletion of the unnecessary directive

 #include <linux/slab.h>

from the header file <linux/percpu.h>, these files under kernel/ were exposed
as needing to include one of <linux/slab.h> or <linux/gfp.h>, so explicit
includes were added where necessary.

Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agobinfmt_misc.c: avoid potential kernel stack overflow
Pavel Emelyanov [Tue, 29 Apr 2008 07:59:24 +0000 (00:59 -0700)] 
binfmt_misc.c: avoid potential kernel stack overflow

This can be triggered with root help only, but...

Register the ":text:E::txt::/root/cat.txt:' rule in binfmt_misc (by root) and
try launching the cat.txt file (by anyone) :) The result is - the endless
recursion in the load_misc_binary -> open_exec -> load_misc_binary chain and
stack overflow.

There's a similar problem with binfmt_script, and there's a sh_bang memner on
linux_binprm structure to handle this, but simply raising this in binfmt_misc
may break some setups when the interpreter of some misc binaries is a script.

So the proposal is to turn sh_bang into a bit, add a new one (the misc_bang)
and raise it in load_misc_binary.  After this, even if we set up the misc ->
script -> misc loop for binfmts one of them will step on its own bang and
exit.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agokthread: call wake_up_process() without the lock being held
Dmitry Adamushko [Tue, 29 Apr 2008 07:59:23 +0000 (00:59 -0700)] 
kthread: call wake_up_process() without the lock being held

From the POV of synchronization, there should be no need to call
wake_up_process() with the 'kthread_create_lock' being held.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocodafs: fix build warning
Andrew Morton [Tue, 29 Apr 2008 07:59:22 +0000 (00:59 -0700)] 
codafs: fix build warning

powerpc:

fs/coda/coda_linux.c: In function 'coda_iattr_to_vattr':
fs/coda/coda_linux.c:137: warning: large integer implicitly truncated to unsigned type

Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodrivers/block/floppy.c: replace init_module&cleanup_module with module_init&module_exit
Jon Schindler [Tue, 29 Apr 2008 07:59:21 +0000 (00:59 -0700)] 
drivers/block/floppy.c: replace init_module&cleanup_module with module_init&module_exit

Replace init_module and cleanup_module with static functions and
module_init/module_exit.

Signed-off-by: Jon Schindler <jkschind@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agosysrq: add show-backtrace-on-all-cpus function
Rik van Riel [Tue, 29 Apr 2008 07:59:21 +0000 (00:59 -0700)] 
sysrq: add show-backtrace-on-all-cpus function

SysRQ-P is not always useful on SMP systems, since it usually ends up showing
the backtrace of a CPU that is doing just fine, instead of the backtrace of
the CPU that is having problems.

This patch adds SysRQ show-all-cpus(L), which shows the backtrace of every
active CPU in the system.  It skips idle CPUs because some SMP systems are
just too large and we already know what the backtrace of the idle task looks
like.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Rik van Riel <riel@redhat.com>
Randy Dunlap <randy.dunlap@oracle.com>
Cc: <lwoodman@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodrivers/misc: replace remaining __FUNCTION__ occurrences
Harvey Harrison [Tue, 29 Apr 2008 07:59:20 +0000 (00:59 -0700)] 
drivers/misc: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofirmware: replace remaining __FUNCTION__ occurrences
Harvey Harrison [Tue, 29 Apr 2008 07:59:19 +0000 (00:59 -0700)] 
firmware: replace remaining __FUNCTION__ occurrences

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Doug Warzecha <Douglas_Warzecha@dell.com>
Cc: Matt Domsch <Matt_Domsch@dell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoproper extern for late_time_init
Adrian Bunk [Tue, 29 Apr 2008 07:59:18 +0000 (00:59 -0700)] 
proper extern for late_time_init

Add a proper extern for late_time_init in include/linux/init.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoexec: remove argv_len from struct linux_binprm
Tetsuo Handa [Tue, 29 Apr 2008 07:59:17 +0000 (00:59 -0700)] 
exec: remove argv_len from struct linux_binprm

I noticed that 2.6.24.2 calculates bprm->argv_len at do_execve().  But it
doesn't update bprm->argv_len after "remove_arg_zero() +
copy_strings_kernel()" at load_script() etc.

audit_bprm() is called from search_binary_handler() and
search_binary_handler() is called from load_script() etc.  Thus, I think the
condition check

  if (bprm->argv_len > (audit_argv_kb << 10))
          return -E2BIG;

in audit_bprm() might return wrong result when strlen(removed_arg) !=
strlen(spliced_args).  Why not update bprm->argv_len at load_script() etc.  ?

By the way, 2.6.25-rc3 seems to not doing the condition check.  Is the field
bprm->argv_len no longer needed?

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Ollie Wild <aaw@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoRemove the macro get_personality
WANG Cong [Tue, 29 Apr 2008 07:59:15 +0000 (00:59 -0700)] 
Remove the macro get_personality

Remove the macro get_personality, use ->personality instead.

Cc: Christoph Hellwig <hch@infradead.org
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Bryan Wu <bryan.wu@analog.com>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoMisc: phantom, consistent whitespace
jan sonnek [Tue, 29 Apr 2008 07:59:15 +0000 (00:59 -0700)] 
Misc: phantom, consistent whitespace

Make it consistent with the rest of the header.

Signed-off-by: jan sonnek <xsonnek@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>