linux-2.6
15 years agocpuset: avoid changing cpuset's mems when errno returned
Li Zefan [Thu, 2 Apr 2009 23:57:52 +0000 (16:57 -0700)] 
cpuset: avoid changing cpuset's mems when errno returned

When writing to cpuset.mems, cpuset has to update its mems_allowed before
calling update_tasks_nodemask(), but this function might return -ENOMEM.

To avoid this rare case, we allocate the memory before changing
mems_allowed, and then pass to update_tasks_nodemask().  Similar to what
update_cpumask() does.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocpuset: rewrite update_tasks_nodemask()
Li Zefan [Thu, 2 Apr 2009 23:57:51 +0000 (16:57 -0700)] 
cpuset: rewrite update_tasks_nodemask()

This patch uses cgroup_scan_tasks() to rebind tasks' vmas to new cpuset's
mems_allowed.

Not only simplify the code largely, but also avoid allocating an array to
hold mm pointers of all the tasks in the cpuset.  This array can be big
(size > PAGESIZE) if we have lots of tasks in that cpuset, thus has a
chance to fail the allocation when under memory stress.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: add 'data' field to struct cgroup_scanner
Li Zefan [Thu, 2 Apr 2009 23:57:50 +0000 (16:57 -0700)] 
cgroups: add 'data' field to struct cgroup_scanner

We need to pass some data to test_task() or process_task() in some cases.
Will be used later.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocpuset: fix possible races in cpu/memory hotplug
Li Zefan [Thu, 2 Apr 2009 23:57:49 +0000 (16:57 -0700)] 
cpuset: fix possible races in cpu/memory hotplug

Change to cpuset->cpus_allowed and cpuset->mems_allowed should be protected
by callback_mutex, otherwise the reader may read wrong cpus/mems. This is
cpuset's lock rule.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: cleanup cache_charge
Daisuke Nishimura [Thu, 2 Apr 2009 23:57:48 +0000 (16:57 -0700)] 
memcg: cleanup cache_charge

Current mem_cgroup_cache_charge is a bit complicated especially
in the case of shmem's swap-in.

This patch cleans it up by using try_charge_swapin and commit_charge_swapin.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: remove redundant message at swapon
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:47 +0000 (16:57 -0700)] 
memcg: remove redundant message at swapon

It's pointed out that swap_cgroup's message at swapon() is nonsense.
Because

  * It can be calculated very easily if all necessary information is
    written in Kconfig.

  * It's not necessary to annoying people at every swapon().

In other view, now, memory usage per swp_entry is reduced to 2bytes from
8bytes(64bit) and I think it's reasonably small.

Reported-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: use css id in swap cgroup for saving memory v5
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:45 +0000 (16:57 -0700)] 
cgroups: use css id in swap cgroup for saving memory v5

Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
size of swap_cgroup goes down to 2 bytes from 8bytes.

This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

Reduction is large.  Of course, there are trade-offs.  This CSS ID will
add overhead to swap-in/swap-out/swap-free.

But in general,
  - swap is a resource which the user tend to avoid use.
  - If swap is never used, swap_cgroup area is not used.
  - Reading traditional manuals, size of swap should be proportional to
    size of memory. Memory size of machine is increasing now.

I think reducing size of swap_cgroup makes sense.

Note:
  - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
  - memcg can be obsolete at rmdir() but not freed while refcnt from
    swap_cgroup is available.

Changelog v4->v5:
 - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
Changlog ->v4:
 - fixed not configured case.
 - deleted unnecessary comments.
 - fixed NULL pointer bug.
 - fixed message in dmesg.

[nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: charge swapcache to proper memcg
Daisuke Nishimura [Thu, 2 Apr 2009 23:57:43 +0000 (16:57 -0700)] 
memcg: charge swapcache to proper memcg

memcg_test.txt says at 4.1:

This swap-in is one of the most complicated work. In do_swap_page(),
following events occur when pte is unchanged.

(1) the page (SwapCache) is looked up.
(2) lock_page()
(3) try_charge_swapin()
(4) reuse_swap_page() (may call delete_swap_cache())
(5) commit_charge_swapin()
(6) swap_free().

Considering following situation for example.

(A) The page has not been charged before (2) and reuse_swap_page()
    doesn't call delete_from_swap_cache().
(B) The page has not been charged before (2) and reuse_swap_page()
    calls delete_from_swap_cache().
(C) The page has been charged before (2) and reuse_swap_page() doesn't
    call delete_from_swap_cache().
(D) The page has been charged before (2) and reuse_swap_page() calls
    delete_from_swap_cache().

    memory.usage/memsw.usage changes to this page/swp_entry will be
 Case          (A)      (B)       (C)     (D)
         Event
       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
          ===========================================
          (3)        +1/+1    +1/+1     +1/+1   +1/+1
          (4)          -       0/ 0       -     -1/ 0
          (5)         0/-1     0/ 0     -1/-1    0/ 0
          (6)          -       0/-1       -      0/-1
          ===========================================
       Result         1/ 1     1/ 1      1/ 1    1/ 1

       In any cases, charges to this page should be 1/ 1.

In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
(because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
charges to the memcg("foo") to which the "current" belongs.
OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
to which the page has been charged.

So, if the "foo" and "baa" is different(for example because of task move),
this charge will be moved from "baa" to "foo".

I think this is an unexpected behavior.

This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
to return the memcg to which the swapcache has been charged if PCG_USED bit
is set.
IIUC, checking PCG_USED bit of swapcache is safe under page lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: remove mem_cgroup_reclaim_imbalance() remnants
KOSAKI Motohiro [Thu, 2 Apr 2009 23:57:41 +0000 (16:57 -0700)] 
memcg: remove mem_cgroup_reclaim_imbalance() remnants

commit 4f98a2fee8acdb4ac84545df98cccecfd130f8db (vmscan: split LRU lists
into anon & file sets) removed mem_cgroup_reclaim_imbalance(), but there
are some leftovers in memcontrol.h.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: remove mem_cgroup_calc_mapped_ratio()
KOSAKI Motohiro [Thu, 2 Apr 2009 23:57:40 +0000 (16:57 -0700)] 
memcg: remove mem_cgroup_calc_mapped_ratio()

Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
removed and KAMEZAWA-san suggested it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: show memcg information during OOM
Balbir Singh [Thu, 2 Apr 2009 23:57:39 +0000 (16:57 -0700)] 
memcg: show memcg information during OOM

Add RSS and swap to OOM output from memcg

Display memcg values like failcnt, usage and limit when an OOM occurs due
to memcg.

Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
Daisuke Nishimura and KOSAKI Motohiro for review.

Sample output
-------------

Task in /a/x killed as a result of limit of /a
memory: usage 1048576kB, limit 1048576kB, failcnt 4183
memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

[akpm@linux-foundation.org: compilation fix]
[akpm@linux-foundation.org: fix kerneldoc and whitespace]
[akpm@linux-foundation.org: add printk facility level]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: fix OOM killer under memcg
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:38 +0000 (16:57 -0700)] 
memcg: fix OOM killer under memcg

This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.

But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup

/groupA/ hierarchy=1, limit=1G,
01 nolimit
02 nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.

To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: const fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: fix shrinking memory to return -EBUSY by fixing retry algorithm
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:36 +0000 (16:57 -0700)] 
memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm

As pointed out, shrinking memcg's limit should return -EBUSY after
reasonable retries.  This patch tries to fix the current behavior of
shrink_usage.

Before looking into "shrink should return -EBUSY" problem, we should fix
hierarchical reclaim code.  It compares current usage and current limit,
but it only makes sense when the kernel reclaims memory because hit
limits.  This is also a problem.

What this patch does are.

  1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
     hierarchical reclaim returns immediately and the caller checks the kernel
     should shrink more or not.
     (At shrinking memory, usage is always smaller than limit. So check for
      usage < limit is useless.)

  2. For adjusting to above change, 2 changes in "shrink"'s retry path.
     2-a. retry_count depends on # of children because the kernel visits
  the children under hierarchy one by one.
     2-b. rather than checking return value of hierarchical_reclaim's progress,
  compares usage-before-shrink and usage-after-shrink.
  If usage-before-shrink <= usage-after-shrink, retry_count is
  decremented.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: hierarchical stat
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:35 +0000 (16:57 -0700)] 
memcg: hierarchical stat

Clean up memory.stat file routine and show "total" hierarchical stat.

This patch does
  - renamed get_all_zonestat to be get_local_zonestat.
  - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
  - add mcs_stat to cover both of per-cpu/per-lru stat.
  - add "total" stat of hierarchy (*)
  - add a callback system to scan all memcg under a root.
== "total" is added.
[kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
cache 0
rss 0
pgpgin 0
pgpgout 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 50331648
hierarchical_memsw_limit 9223372036854775807
total_cache 65536
total_rss 192512
total_pgpgin 218
total_pgpgout 155
total_inactive_anon 0
total_active_anon 135168
total_inactive_file 61440
total_active_file 4096
total_unevictable 0
==
(*) maybe the user can do calc hierarchical stat by his own program
   in userland but if it can be written in clean way, it's worth to be
   shown, I think.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: use CSS ID
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:33 +0000 (16:57 -0700)] 
memcg: use CSS ID

Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

Assume folloing tree.

group_A (ID=3)
/01 (ID=4)
   /0A (ID=7)
/02 (ID=10)
group_B (ID=5)
and task in group_A/01/0A hits limit at group_A.

reclaim will be done in following order (round-robin).
group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
-> group_A -> .....

Round robin by ID. The last visited cgroup is recorded and restart
from it when it start reclaim again.
(More smart algorithm can be implemented..)

No cgroup_mutex or hierarchy_mutex is required.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodevcgroup: avoid using cgroup_lock
Li Zefan [Thu, 2 Apr 2009 23:57:32 +0000 (16:57 -0700)] 
devcgroup: avoid using cgroup_lock

There is nothing special that has to be protected by cgroup_lock,
so introduce devcgroup_mtuex for it's own use.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodebug cgroup: remove unneeded cgroup_lock
Li Zefan [Thu, 2 Apr 2009 23:57:31 +0000 (16:57 -0700)] 
debug cgroup: remove unneeded cgroup_lock

Since we are in cgroup write handler, so the cgrp is valid, so we don't
have to hold cgroup_mutex when calling cgroup_task_count().  One similar
example is in cgroup_tasks_open().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: don't change release_agent when remount failed
Li Zefan [Thu, 2 Apr 2009 23:57:30 +0000 (16:57 -0700)] 
cgroups: don't change release_agent when remount failed

Remount can fail in either case:
  - wrong mount options is specified, or option 'noprefix' is changed.
  - a to-be-added subsys is already mounted/active.

When using remount to change 'release_agent', for the above former failure
case, remount will return errno with release_agent unchanged, but for the
latter case, remount will return EBUSY with relase_agent changed, which is
unexpected I think:

 # mount -t cgroup -o cpu xxx /cgrp1
 # mount -t cgroup -o cpuset,release_agent=agent1 yyy /cgrp2
 # cat /cgrp2/release_agent
 agent1
 # mount -t cgroup -o remount,cpuset,noprefix,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 not mounted already, or bad option
 # cat /cgrp2/release_agent
 agent1     <-- ok
 # mount -t cgroup -o remount,cpu,cpuset,release_agent=agent2 yyy /cgrp2
 mount: /cgrp2 is busy
 # cat /cgrp2/release_agent
 agent2     <-- unexpected!

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: show correct file mode
Li Zefan [Thu, 2 Apr 2009 23:57:29 +0000 (16:57 -0700)] 
cgroups: show correct file mode

We have some read-only files and write-only files, but currently they are
all set to 0644, which is counter-intuitive and cause trouble for some
cgroup tools like libcgroup.

This patch adds 'mode' to struct cftype to allow cgroup subsys to set it's
own files' file mode, and for the most cases cft->mode can be default to 0
and cgroup will figure out proper mode.

Acked-by: Paul Menage <menage@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: more documentation for remount and release_agent
Li Zefan [Thu, 2 Apr 2009 23:57:28 +0000 (16:57 -0700)] 
cgroups: more documentation for remount and release_agent

This won't remove cpuacct from the mounted hierachy:
 # mount -t cgroup -o cpu,cpuacct xxx /mnt
 # mount -o remount,cpu /mnt

Because for this usage mount(8) will append the new options to the original
options.

And this will get you right:
 # mount [-t cgroup] -o remount,cpu xxx /mnt

Also document how to specify or change release_agent.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agokernel/cgroup.c: kfree(NULL) is legal
Jesper Juhl [Thu, 2 Apr 2009 23:57:27 +0000 (16:57 -0700)] 
kernel/cgroup.c: kfree(NULL) is legal

Reduces object file size a bit:

Before:
$ size kernel/cgroup.o
   text    data     bss     dec     hex filename
  21593    7804    4924   34321    8611 kernel/cgroup.o
After:
$ size kernel/cgroup.o
   text    data     bss     dec     hex filename
  21537    7744    4924   34205    859d kernel/cgroup.o

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroup: fix frequent -EBUSY at rmdir
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:26 +0000 (16:57 -0700)] 
cgroup: fix frequent -EBUSY at rmdir

In following situation, with memory subsystem,

/groupA use_hierarchy==1
/01 some tasks
/02 some tasks
/03 some tasks
/04 empty

When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.

In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)

This patch tries to modify above behavior, by
- retries if css_refcnt is got by someone.
- add "return value" to pre_destroy() and allows subsystem to
  say "we're really busy!"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroup: CSS ID support
KAMEZAWA Hiroyuki [Thu, 2 Apr 2009 23:57:25 +0000 (16:57 -0700)] 
cgroup: CSS ID support

Patch for Per-CSS(Cgroup Subsys State) ID and private hierarchy code.

This patch attaches unique ID to each css and provides following.

 - css_lookup(subsys, id)
   returns pointer to struct cgroup_subysys_state of id.
 - css_get_next(subsys, id, rootid, depth, foundid)
   returns the next css under "root" by scanning

When cgroup_subsys->use_id is set, an id for css is maintained.

The cgroup framework only parepares
- css_id of root css for subsys
- id is automatically attached at creation of css.
- id is *not* freed automatically. Because the cgroup framework
  don't know lifetime of cgroup_subsys_state.
  free_css_id() function is provided. This must be called by subsys.

There are several reasons to develop this.
- Saving space .... For example, memcg's swap_cgroup is array of
  pointers to cgroup. But it is not necessary to be very fast.
  By replacing pointers(8bytes per ent) to ID (2byes per ent), we can
  reduce much amount of memory usage.

- Scanning without lock.
  CSS_ID provides "scan id under this ROOT" function. By this, scanning
  css under root can be written without locks.
  ex)
  do {
rcu_read_lock();
next = cgroup_get_next(subsys, id, root, &found);
/* check sanity of next here */
css_tryget();
rcu_read_unlock();
id = found + 1
 } while(...)

Characteristics:
- Each css has unique ID under subsys.
- Lifetime of ID is controlled by subsys.
- css ID contains "ID" and "Depth in hierarchy" and stack of hierarchy
- Allowed ID is 1-65535, ID 0 is UNUSED ID.

Design Choices:
- scan-by-ID v.s. scan-by-tree-walk.
  As /proc's pid scan does, scan-by-ID is robust when scanning is done
  by following kind of routine.
  scan -> rest a while(release a lock) -> conitunue from interrupted
  memcg's hierarchical reclaim does this.

- When subsys->use_id is set, # of css in the system is limited to
  65535.

[bharata@linux.vnet.ibm.com: remove rcu_read_lock() from css_get_next()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups
Grzegorz Nosek [Thu, 2 Apr 2009 23:57:23 +0000 (16:57 -0700)] 
cgroups: relax ns_can_attach checks to allow attaching to grandchild cgroups

The ns_proxy cgroup allows moving processes to child cgroups only one
level deep at a time.  This commit relaxes this restriction and makes it
possible to attach tasks directly to grandchild cgroups, e.g.:

($pid is in the root cgroup)
echo $pid > /cgroup/CG1/CG2/tasks

Previously this operation would fail with -EPERM and would have to be
performed as two steps:
echo $pid > /cgroup/CG1/tasks
echo $pid > /cgroup/CG1/CG2/tasks

Also, the target cgroup no longer needs to be empty to move a task there.

Signed-off-by: Grzegorz Nosek <root@localdomain.pl>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: fix cgroup.h comments
Paul Menage [Thu, 2 Apr 2009 23:57:22 +0000 (16:57 -0700)] 
cgroups: fix cgroup.h comments

Fix the style of some multi-line comments in cgroup.h to match
Documentation/CodingStyle

Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodocumentation: fix unix_dgram_qlen description
Li Xiaodong [Thu, 2 Apr 2009 23:57:21 +0000 (16:57 -0700)] 
documentation: fix unix_dgram_qlen description

Previous description about system parameter in /proc/sys/net/unix/ is
wrong (or missed).  Simply add a new description about unix_dgram_qlen
according to latest kernel.

Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodocumentation: update Documentation/filesystem/proc.txt and Documentation/sysctls
Shen Feng [Thu, 2 Apr 2009 23:57:20 +0000 (16:57 -0700)] 
documentation: update Documentation/filesystem/proc.txt and Documentation/sysctls

Now /proc/sys is described in many places and much information is
redundant.  This patch updates the proc.txt and move the /proc/sys
desciption out to the files in Documentation/sysctls.

Details are:

merge
-  2.1  /proc/sys/fs - File system data
-  2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
-  2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface
with Documentation/sysctls/fs.txt.

remove
-  2.2  /proc/sys/fs/binfmt_misc - Miscellaneous binary formats
since it's not better then the Documentation/binfmt_misc.txt.

merge
-  2.3  /proc/sys/kernel - general kernel parameters
with Documentation/sysctls/kernel.txt

remove
-  2.5  /proc/sys/dev - Device specific parameters
since it's obsolete the sysfs is used now.

remove
-  2.6  /proc/sys/sunrpc - Remote procedure calls
since it's not better then the Documentation/sysctls/sunrpc.txt

move
-  2.7  /proc/sys/net - Networking stuff
-  2.9  Appletalk
-  2.10 IPX
to newly created Documentation/sysctls/net.txt.

remove
-  2.8  /proc/sys/net/ipv4 - IPV4 settings
since it's not better then the Documentation/networking/ip-sysctl.txt.

add
- Chapter 3 Per-Process Parameters
to descibe /proc/<pid>/xxx parameters.

Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodocumentation: ignore byproducts from latex
Henrik Austad [Thu, 2 Apr 2009 23:57:18 +0000 (16:57 -0700)] 
documentation: ignore byproducts from latex

When using 'make pdfdocs', auto-generated files should be ignored

Signed-off-by: Henrik Austad <henrik@austad.us>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agohppfs: hppfs_read_file() may return -ERROR
Roel Kluin [Thu, 2 Apr 2009 23:57:18 +0000 (16:57 -0700)] 
hppfs: hppfs_read_file() may return -ERROR

hppfs_read_file() may return (ssize_t) -ENOMEM, or -EFAULT.  When stored
in size_t 'count', these errors will not be noticed, a large value will be
added to *ppos.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: avoid false EIO errors
Jan Kara [Thu, 2 Apr 2009 23:57:17 +0000 (16:57 -0700)] 
ext3: avoid false EIO errors

Sometimes block_write_begin() can map buffers in a page but later we
fail to copy data into those buffers (because the source page has been
paged out in the mean time).  We then end up with !uptodate mapped
buffers.  To add a bit more to the confusion, block_write_end() does
not commit any data (and thus does not any mark buffers as uptodate) if
we didn't succeed with copying all the data.

Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new
aops) missed these cases and thus we were inserting non-uptodate
buffers to transaction's list which confuses JBD code and it reports IO
errors, aborts a transaction and generally makes users afraid about
their data ;-P.

This patch fixes the problem by reorganizing ext3_..._write_end() code
to first call block_write_end() to mark buffers with valid data
uptodate and after that we file only uptodate buffers to transaction's
lists.

We also fix a problem where we could leave blocks allocated beyond i_size
(i_disksize in fact) because of failed write. We now add inode to orphan
list when write fails (to be safe in case we crash) and then truncate blocks
beyond i_size in a separate transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: return -EIO not -ESTALE on directory traversal through deleted inode
Bryan Donlan [Thu, 2 Apr 2009 23:57:15 +0000 (16:57 -0700)] 
ext3: return -EIO not -ESTALE on directory traversal through deleted inode

ext3_iget() returns -ESTALE if invoked on a deleted inode, in order to
report errors to NFS properly.  However, in ext[234]_lookup(), this
-ESTALE can be propagated to userspace if the filesystem is corrupted such
that a directory entry references a deleted inode.  This leads to a
misleading error message - "Stale NFS file handle" - and confusion on the
part of the admin.

The bug can be easily reproduced by creating a new filesystem, making a
link to an unused inode using debugfs, then mounting and attempting to ls
-l said link.

This patch thus changes ext3_lookup to return -EIO if it receives -ESTALE
from ext3_iget(), as ext3 does for other filesystem metadata corruption;
and also invokes the appropriate ext*_error functions when this case is
detected.

Signed-off-by: Bryan Donlan <bdonlan@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: use unsigned instead of int for type of blocksize in fs/ext3/namei.c
Wei Yongjun [Thu, 2 Apr 2009 23:57:14 +0000 (16:57 -0700)] 
ext3: use unsigned instead of int for type of blocksize in fs/ext3/namei.c

Use unsigned instead of int for the parameter which carries a blocksize.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agojbd: fix oops in jbd_journal_init_inode() on corrupted fs
Jan Kara [Thu, 2 Apr 2009 23:57:13 +0000 (16:57 -0700)] 
jbd: fix oops in jbd_journal_init_inode() on corrupted fs

On 32-bit system with CONFIG_LBD getblk can fail because provided block
number is too big. Make JBD gracefully handle that.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: <dmaciejak@fortinet.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoext3: remove the BKL in ext3/ioctl.c
Cyrus Massoumi [Thu, 2 Apr 2009 23:57:12 +0000 (16:57 -0700)] 
ext3: remove the BKL in ext3/ioctl.c

Reformat ext3/ioctl.c to make it look more like ext4/ioctl.c and remove
the BKL around ext3_ioctl().

Signed-off-by: Cyrus Massoumi <cyrusm@gmx.net>
Cc: <linux-ext4@vger.kernel.org>
Acked-by: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agopnpbios: propagate kthread_run() error
Erik Ekman [Thu, 2 Apr 2009 23:57:09 +0000 (16:57 -0700)] 
pnpbios: propagate kthread_run() error

- Error code from kthread_run() is now returned in pnpbios_thread_init()

- Remove variable which always was 0.

Signed-off-by: Erik Ekman <erik@kryo.se>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agopnpbios: fix warning if CONFIG_HOTPLUG=n
Erik Ekman [Thu, 2 Apr 2009 23:57:08 +0000 (16:57 -0700)] 
pnpbios: fix warning if CONFIG_HOTPLUG=n

drivers/pnp/pnpbios/core.c: In function 'pnpbios_thread_init':
drivers/pnp/pnpbios/core.c:578: warning: unused variable 'task'

Signed-off-by: Erik Ekman <erik@kryo.se>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agospi-gpio: allow operation without CS signal
Michael Buesch [Thu, 2 Apr 2009 23:57:07 +0000 (16:57 -0700)] 
spi-gpio: allow operation without CS signal

Change spi-gpio so that it is possible to drive SPI communications over
GPIO without the need for a chipselect signal.

This is useful in very small setups where there's only one slave device
on the bus.

This patch does not affect existing setups.

I use this for a tiny communication channel between an embedded device and
a microcontroller.  There are not enough GPIOs available for chipselect
and it's not needed anyway in this case.

Signed-off-by: Michael Buesch <mb@bu3sch.de>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agogpio: gpio_{request,free}() now required (feature removal)
David Brownell [Thu, 2 Apr 2009 23:57:06 +0000 (16:57 -0700)] 
gpio: gpio_{request,free}() now required (feature removal)

We want to phase out the GPIO "autorequest" mechanism in gpiolib and
require all callers to use gpio_request().

 - Update feature-removal-schedule
 - Update the documentation now
 - Convert the relevant pr_warning() in gpiolib to a WARN()
   so folk using this mechanism get a noisy stack dump

Some drivers and board init code will probably need to change.
Implementations not using gpiolib will still be fine; they are already
required to implement gpio_{request,free}() stubs.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agogpiolib: allow GPIOs to be named
Daniel Silverstone [Thu, 2 Apr 2009 23:57:05 +0000 (16:57 -0700)] 
gpiolib: allow GPIOs to be named

Allow GPIOs in GPIOLIB chips to be named.  This name is then used when the
GPIO is exported to sysfs, although it could be used elsewhere if deemed
useful.

Signed-off-by: Daniel Silverstone <dsilvers@simtec.co.uk>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agortc: add m41t62 support to rtc-m41t80 driver
Daniel Glockner [Thu, 2 Apr 2009 23:57:03 +0000 (16:57 -0700)] 
rtc: add m41t62 support to rtc-m41t80 driver

Compared to the other supported chips, the m41t62 uses a different
register to set the square wave frequency.

Signed-off-by: Daniel Glockner <dg@emlix.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Brownell <david-b@pacbell.net>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agortc-v3020: add ability to access v3020 chip with GPIOs
Mike Rapoport [Thu, 2 Apr 2009 23:57:01 +0000 (16:57 -0700)] 
rtc-v3020: add ability to access v3020 chip with GPIOs

The v3020 RTC can be connected to GPIOs as well as to memory-like
interface.  Add ability to use GPIO bit-bang for v3020 read-write access.

[akpm@linux-foundation.org: fix off-by-one in error path]
Signed-off-by: Mike Rapoport <mike@compulab.co.il>
Acked-by: Alessandro Zummo <a.zummo@towertech.it>
Cc: David Brownell <david-b@pacbell.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoinitramfs: prevent initramfs printk message being split by messages from other code.
Simon Kitching [Thu, 2 Apr 2009 23:57:00 +0000 (16:57 -0700)] 
initramfs: prevent initramfs printk message being split by messages from other code.

initramfs uses printk without a linefeed, then does some work, then uses
printk to finish the message off.  However if some other code does a
printk in between, then the messages get mixed together.  Better for each
message to be an independent line...

Example of problem that this fixes:

    checking if image is initramfs...<7>Switched to high resolution mode on CPU 1
    Switched to high resolution mode on CPU 0
    it is

Signed-off-by: Simon Kitching <skitching@apache.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoSimplify copy_thread()
Alexey Dobriyan [Thu, 2 Apr 2009 23:56:59 +0000 (16:56 -0700)] 
Simplify copy_thread()

First argument unused since 2.3.11.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory_accessor: implement the new memory_accessor interfaces for SPI EEPROMs
David Brownell [Thu, 2 Apr 2009 23:56:58 +0000 (16:56 -0700)] 
memory_accessor: implement the new memory_accessor interfaces for SPI EEPROMs

- Define new setup() hook to export the accessor
 - Implement accessor methods

Moves some error checking out of the sysfs interface code into the layer
below it, which is now shared by both sysfs and memory access code.

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory_accessor: implement the new memory_accessor interface for I2C EEPROM
Kevin Hilman [Thu, 2 Apr 2009 23:56:57 +0000 (16:56 -0700)] 
memory_accessor: implement the new memory_accessor interface for I2C EEPROM

In the case of at24, the platform code registers a 'setup' callback with
the at24_platform_data.  When the at24 driver detects an EEPROM, it fills
out the read and write functions of the memory_accessor and calls the
setup callback passing the memory_accessor struct.  The platform code can
then use the read/write functions in the memory_accessor struct for
reading and writing the EEPROM.

Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory_accessor: new interface for reading/writing persistent memory
Kevin Hilman [Thu, 2 Apr 2009 23:56:56 +0000 (16:56 -0700)] 
memory_accessor: new interface for reading/writing persistent memory

Add an interface by which other kernel code can read/write persistent
memory such as I2C or SPI EEPROMs, or devices which provide NVRAM.  Use
cases include storage of board-specific configuration data like Ethernet
addresses and sensor calibrations.

Original idea, review and improvement suggestions by David Brownell.

Acked-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Cc: Jean Delvare <khali@linux-fr.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoworkqueue: add to_delayed_work() helper function
Jean Delvare [Thu, 2 Apr 2009 23:56:54 +0000 (16:56 -0700)] 
workqueue: add to_delayed_work() helper function

It is a fairly common operation to have a pointer to a work and to need a
pointer to the delayed work it is contained in.  In particular, all
delayed works which want to rearm themselves will have to do that.  So it
would seem fair to offer a helper function for this operation.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Greg KH <greg@kroah.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agouml: fix warnings in kernel_execve
Miklos Szeredi [Thu, 2 Apr 2009 23:56:53 +0000 (16:56 -0700)] 
uml: fix warnings in kernel_execve

Fix the following warnings:

arch/um/kernel/syscall.c: In function 'kernel_execve':
arch/um/kernel/syscall.c:130: warning: passing argument 1 of 'um_execve' discards qualifiers from pointer target type
arch/um/kernel/syscall.c:130: warning: passing argument 2 of 'um_execve' discards qualifiers from pointer target type
arch/um/kernel/syscall.c:130: warning: passing argument 3 of 'um_execve' discards qualifiers from pointer target type

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agouml: fix link error from prefixing of i386 syscalls with ptregs_
Miklos Szeredi [Thu, 2 Apr 2009 23:56:51 +0000 (16:56 -0700)] 
uml: fix link error from prefixing of i386 syscalls with ptregs_

Fix the following link error:

arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x11c): undefined reference to `ptregs_fork'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x140): undefined reference to `ptregs_execve'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2cc): undefined reference to `ptregs_iopl'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2d8): undefined reference to `ptregs_vm86old'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2f0): undefined reference to `ptregs_sigreturn'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x2f4): undefined reference to `ptregs_clone'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3ac): undefined reference to `ptregs_vm86'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3c8): undefined reference to `ptregs_rt_sigreturn'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x3fc): undefined reference to `ptregs_sigaltstack'
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x40c): undefined reference to `ptregs_vfork'

This was introduced by commit 253f29a4, "x86: pass in pt_regs pointer
for syscalls that need it"

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jeff Dike <jdike@addtoit.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agouml: fix compile error from net_device_ops conversion
Miklos Szeredi [Thu, 2 Apr 2009 23:56:49 +0000 (16:56 -0700)] 
uml: fix compile error from net_device_ops conversion

Fix the following compile error:

arch/um/drivers/net_kern.c: In function 'uml_inetaddr_event':
arch/um/drivers/net_kern.c:760: error: 'struct net_device' has no member named 'open'

This was introduced by commit 8bb95b39, "uml: convert network device
to netdevice ops".

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofloppy: provide a PNP device table in the module.
Scott James Remnant [Thu, 2 Apr 2009 23:56:47 +0000 (16:56 -0700)] 
floppy: provide a PNP device table in the module.

The missing device table means that the floppy module is not auto-loaded,
even when the appropriate PNP device (0700) is found.

We don't actually use the table in the module, since the device doesn't
have a struct pnp_driver, but it's sufficient to cause an alias in the
module that udev/modprobe will use.

Signed-off-by: Scott James Remnant <scott@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Philippe De Muyter <phdm@macqel.be>
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agovfs: check bh->b_blocknr only if BH_Mapped is set
Nikanth Karthikesan [Thu, 2 Apr 2009 23:56:46 +0000 (16:56 -0700)] 
vfs: check bh->b_blocknr only if BH_Mapped is set

Check bh->b_blocknr only if BH_Mapped is set.

akpm: I doubt if b_blocknr is ever uninitialised here, but it could
conceivably cause a problem if we're doing a lookup for block zero.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomm: define a UNIQUE value for AS_UNEVICTABLE flag
Lee Schermerhorn [Thu, 2 Apr 2009 23:56:45 +0000 (16:56 -0700)] 
mm: define a UNIQUE value for AS_UNEVICTABLE flag

A new "address_space flag"--AS_MM_ALL_LOCKS--was defined to use the next
available AS flag while the Unevictable LRU was under development.  The
Unevictable LRU was using the same flag and "no one" noticed.  Current
mainline, since 2.6.28, has same value for two symbolic flag names.

So, define a unique flag value for AS_UNEVICTABLE--up close to the other
flags, [at the cost of an additional #ifdef] so we'll notice next time.
Note that #ifdef is not actually required, if we don't mind having the
unused flag value defined.

Replace #defines with an enum.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: <stable@kernel.org> [2.6.28.x, 2.6.29.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoadd fiemap.h to header-y
Eric Sandeen [Thu, 2 Apr 2009 23:56:44 +0000 (16:56 -0700)] 
add fiemap.h to header-y

Include fiemap.h in header-y; it defines the interface for the
FS_IOC_FIEMAP file mapping ioctl.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoMAINTAINERS: add hvc_console
Michael Ellerman [Thu, 2 Apr 2009 23:56:43 +0000 (16:56 -0700)] 
MAINTAINERS: add hvc_console

Add a MAINTAINERS entry for the hypervisor virtual console driver.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomm: do_xip_mapping_read: fix length calculation
Martin Schwidefsky [Thu, 2 Apr 2009 23:56:42 +0000 (16:56 -0700)] 
mm: do_xip_mapping_read: fix length calculation

The calculation of the value nr in do_xip_mapping_read is incorrect.  If
the copy required more than one iteration in the do while loop the copies
variable will be non-zero.  The maximum length that may be passed to the
call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
check only compares against (nr > len).

This bug is the cause for the heap corruption Carsten has been chasing
for so long:

*** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x200000b9b44]
/lib64/libc.so.6(cfree+0x8e)[0x200000bdade]
/bin/bash(free_buffered_stream+0x32)[0x80050e4e]
/bin/bash(close_buffered_stream+0x1c)[0x80050ea4]
/bin/bash(unset_bash_input+0x2a)[0x8001c366]
/bin/bash(make_child+0x1d4)[0x8004115c]
/bin/bash[0x8002fc3c]
/bin/bash(execute_command_internal+0x656)[0x8003048e]
/bin/bash(execute_command+0x5e)[0x80031e1e]
/bin/bash(execute_command_internal+0x79a)[0x800305d2]
/bin/bash(execute_command+0x5e)[0x80031e1e]
/bin/bash(reader_loop+0x270)[0x8001efe0]
/bin/bash(main+0x1328)[0x8001e960]
/lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8]
/bin/bash(clearerr+0x5e)[0x8001c092]

With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2
"ext2/xip: refuse to change xip flag during remount with busy inodes" can
be removed again.

Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agorandom: align rekey_work's timer
Anton Blanchard [Thu, 2 Apr 2009 23:56:39 +0000 (16:56 -0700)] 
random: align rekey_work's timer

Align rekey_work. Even though it's infrequent, we may as well line it up.

Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomm: align vmstat_work's timer
Anton Blanchard [Thu, 2 Apr 2009 23:56:39 +0000 (16:56 -0700)] 
mm: align vmstat_work's timer

Even though vmstat_work is marked deferrable, there are still benefits to
aligning it.  For certain applications we want to keep OS jitter as low as
possible and aligning timers and work so they occur together can reduce
their overall impact.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agowriteback: guard against jiffies wraparound on inode->dirtied_when checks (try #3)
Jeff Layton [Thu, 2 Apr 2009 23:56:37 +0000 (16:56 -0700)] 
writeback: guard against jiffies wraparound on inode->dirtied_when checks (try #3)

The dirtied_when value on an inode is supposed to represent the first time
that an inode has one of its pages dirtied.  This value is in units of
jiffies.  It's used in several places in the writeback code to determine
when to write out an inode.

The problem is that these checks assume that dirtied_when is updated
periodically.  If an inode is continuously being used for I/O it can be
persistently marked as dirty and will continue to age.  Once the time
compared to is greater than or equal to half the maximum of the jiffies
type, the logic of the time_*() macros inverts and the opposite of what is
needed is returned.  On 32-bit architectures that's just under 25 days
(assuming HZ == 1000).

As the least-recently dirtied inode, it'll end up being the first one that
pdflush will try to write out.  sync_sb_inodes does this check:

/* Was this inode dirtied after sync_sb_inodes was called? */
  if (time_after(inode->dirtied_when, start))
  break;

...but now dirtied_when appears to be in the future.  sync_sb_inodes bails
out without attempting to write any dirty inodes.  When this occurs,
pdflush will stop writing out inodes for this superblock.  Nothing can
unwedge it until jiffies moves out of the problematic window.

This patch fixes this problem by changing the checks against dirtied_when
to also check whether it appears to be in the future.  If it does, then we
consider the value to be far in the past.

This should shrink the problematic window of time to such a small period
(30s) as not to matter.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years ago__tty_open(): use the correct type for saved_flags
Andrew Morton [Thu, 2 Apr 2009 23:56:36 +0000 (16:56 -0700)] 
__tty_open(): use the correct type for saved_flags

filp->f_flags is unsigned, so use that type for the local copy.

Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agovfs: skip I_CLEAR state inodes
Wu Fengguang [Thu, 2 Apr 2009 23:56:34 +0000 (16:56 -0700)] 
vfs: skip I_CLEAR state inodes

clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so
_outside_ of inode_lock.  So any I_FREEING testing is incomplete without a
coupled testing of I_CLEAR.

So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and
add_dquot_ref().

Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara
reminds fixing the other two cases.

Masayoshi MIZUMA has a nice panic flow:

=====================================================================
            [process A]               |        [process B]
 |                                    |
 |    prune_icache()                  | drop_pagecache()
 |      spin_lock(&inode_lock)        |   drop_pagecache_sb()
 |      inode->i_state |= I_FREEING;  |       |
 |      spin_unlock(&inode_lock)      |       V
 |          |                         |     spin_lock(&inode_lock)
 |          V                         |         |
 |      dispose_list()                |         |
 |        list_del()                  |         |
 |        clear_inode()               |         |
 |          inode->i_state = I_CLEAR  |         |
 |            |                       |         V
 |            |                       |      if (inode->i_state & (I_FREEING|I_WILL_FREE))
 |            |                       |              continue;           <==== NOT MATCH
 |            |                       |
 |            |                       | (DANGER from here on! Accessing disposing inode!)
 |            |                       |
 |            |                       |      __iget()
 |            |                       |        list_move() <===== PANIC on poisoned list !!
 V            V                       |
(time)
=====================================================================

Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agonommu: fix a number of issues with the per-MM VMA patch
David Howells [Thu, 2 Apr 2009 23:56:32 +0000 (16:56 -0700)] 
nommu: fix a number of issues with the per-MM VMA patch

Fix a number of issues with the per-MM VMA patch:

 (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
     a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
     system.

 (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
     lest it overflow.

 (3) Move the allocation of the vm_area_struct slab back for fork.c.

 (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

 (5) Use BUG_ON() rather than if () BUG().

 (6) Make the default validate_nommu_regions() a static inline rather than a
     #define.

 (7) Make free_page_series()'s objection to pages with a refcount != 1 more
     informative.

 (8) Adjust the __put_nommu_region() banner comment to indicate that the
     semaphore must be held for writing.

 (9) Limit the number of warnings about munmaps of non-mmapped regions.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofb: nvidiafb recognizes geforcego 7300 chip as mobile
Sergey Senozhatsky [Thu, 2 Apr 2009 23:56:30 +0000 (16:56 -0700)] 
fb: nvidiafb recognizes geforcego 7300 chip as mobile

nvidiafb recognizes geforcego 7300 chip as mobile

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@mail.by>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agogeneric debug pagealloc: build fix
Akinobu Mita [Thu, 2 Apr 2009 23:56:30 +0000 (16:56 -0700)] 
generic debug pagealloc: build fix

This fixes a build failure with generic debug pagealloc:

  mm/debug-pagealloc.c: In function 'set_page_poison':
  mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
  mm/debug-pagealloc.c: In function 'clear_page_poison':
  mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
  mm/debug-pagealloc.c: In function 'page_poison':
  mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
  mm/debug-pagealloc.c: At top level:
  mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
  include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
  mm/debug-pagealloc.c: In function 'kernel_map_pages':
  mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

by fixing

 - debug_flags should be in struct page
 - define DEBUG_PAGEALLOC config option for all architectures

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoglge: remove unused #include <version.h>
Huang Weiyi [Thu, 2 Apr 2009 05:33:55 +0000 (05:33 +0000)] 
glge: remove unused #include <version.h>

Remove unused #include <version.h> in drivers/net/qlge/qlge_ethtool.

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agodnet: remove unused #include <version.h>
Huang Weiyi [Thu, 2 Apr 2009 05:33:50 +0000 (05:33 +0000)] 
dnet: remove unused #include <version.h>

Remove unused #include <version.h> in drivers/net/dnet.c.

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agotcp: miscounts due to tcp_fragment pcount reset
Ilpo Järvinen [Wed, 1 Apr 2009 23:18:20 +0000 (23:18 +0000)] 
tcp: miscounts due to tcp_fragment pcount reset

It seems that trivial reset of pcount to one was not sufficient
in tcp_retransmit_skb. Multiple counters experience a positive
miscount when skb's pcount gets lowered without the necessary
adjustments (depending on skb's sacked bits which exactly), at
worst a packets_out miscount can crash at RTO if the write queue
is empty!

Triggering this requires mss change, so bidir tcp or mtu probe or
like.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Uwe Bugla <uwe.bugla@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agotcp: add helper for counter tweaking due mid-wq change
Ilpo Järvinen [Wed, 1 Apr 2009 23:15:17 +0000 (23:15 +0000)] 
tcp: add helper for counter tweaking due mid-wq change

We need full-scale adjustment to fix a TCP miscount in the next
patch, so just move it into a helper and call for that from the
other places.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agohso: fix for the 'invalid frame length' messages
Jan Dumon [Wed, 1 Apr 2009 22:59:07 +0000 (22:59 +0000)] 
hso: fix for the 'invalid frame length' messages

Some devices cannot send very short usb transfers. To get around this the
firmware adds a known pattern and flags the driver that it should check for
this pattern on short transfers. This flag was not taken into account by
the driver.

Signed-off-by: Jan Dumon <j.dumon@option.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agohso: fix for crash when unplugging the device
Jan Dumon [Wed, 1 Apr 2009 22:57:20 +0000 (22:57 +0000)] 
hso: fix for crash when unplugging the device

Changed the order in which things are freed. This fixes an oops when
unplugging the device while network traffic is ongoing.

Signed-off-by: Jan Dumon <j.dumon@option.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agofsl_pq_mdio: Fix compile failure
Segher Boessenkool [Thu, 2 Apr 2009 20:57:30 +0000 (13:57 -0700)] 
fsl_pq_mdio: Fix compile failure

Add EXPORT_SYMBOL_GPL(fsl_pq_mdio_bus_name) for module builds

Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agofsl_pq_mdio: Revive UCC MDIO support
Anton Vorontsov [Tue, 31 Mar 2009 08:33:52 +0000 (08:33 +0000)] 
fsl_pq_mdio: Revive UCC MDIO support

commit 1577ecef766650a57fceb171acee2b13cbfaf1d3 ("netdev: Merge UCC
and gianfar MDIO bus drivers") introduced a regression so that UCC
MDIO buses no longer work.

This is because fsl_pq_mdio driver wrongly masks all non-TBI PHYs
for !fsl,gianfar-mdio buses, while it should do that only for
fsl,gianfar-tbi buses.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoucc_geth: Pass proper device to DMA routines, otherwise oops happens
Anton Vorontsov [Thu, 2 Apr 2009 08:26:07 +0000 (01:26 -0700)] 
ucc_geth: Pass proper device to DMA routines, otherwise oops happens

The driver should pass a device that actually specifies internal DMA
ops, but currently it passes netdev's device, which is wrong and that
causes following oops:

Kernel BUG at c01c4df8 [verbose debug info unavailable]
Oops: Exception in kernel mode, sig: 5 [#1]
[...]
NIP [c01c4df8] get_new_skb+0x7c/0xf8
LR [c01c4da4] get_new_skb+0x28/0xf8
Call Trace:
[ef82be00] [c01c4da4] get_new_skb+0x28/0xf8 (unreliable)
[ef82be20] [c01c4eb8] rx_bd_buffer_set+0x44/0x98
[ef82be40] [c01c62bc] ucc_geth_startup+0x11b0/0x147c
[ef82be80] [c01c6674] ucc_geth_open+0xec/0x2a4
[ef82bea0] [c02288a4] dev_open+0xc0/0x11c
[...]

Fix this by passing of_device's device that specifies DMA ops in its
archdata.

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoi.MX31: Fixing cs89x0 network building to i.MX31ADS
Alan Carvalho de Assis [Tue, 31 Mar 2009 03:36:39 +0000 (03:36 +0000)] 
i.MX31: Fixing cs89x0 network building to i.MX31ADS

This is a fix to get cs89x0 network driver working on i.MX31ADS

Signed-off-by: Alan Carvalho de Assis <acassis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agotc35815: Fix build error if NAPI enabled
Atsushi Nemoto [Thu, 2 Apr 2009 08:17:36 +0000 (01:17 -0700)] 
tc35815: Fix build error if NAPI enabled

This driver contains experimental NAPI code disabled by default.
The commit bea3348ee ("[NET]: Make NAPI polling independent of struct
net_device objects.") converted the NAPI path of this driver but that
conversion was not complete.  This patch fixes a build error
introduced by the commit.

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agohso: add Vendor/Product ID's for new devices
Jan Dumon [Thu, 2 Apr 2009 08:16:44 +0000 (01:16 -0700)] 
hso: add Vendor/Product ID's for new devices

Add Vendor/Product ID's for new devices.
Removed duplicate product ID 0x7361.

Signed-off-by: Jan Dumon <j.dumon@option.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoucc_geth: Remove unused header
Kumar Gala [Tue, 31 Mar 2009 03:42:59 +0000 (03:42 +0000)] 
ucc_geth: Remove unused header

Now that the driver is exclusively an of_platform driver we no longer
use the structs and #defines in fsl_devices.h

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agogianfar: Remove unused header
Kumar Gala [Tue, 31 Mar 2009 03:42:58 +0000 (03:42 +0000)] 
gianfar: Remove unused header

Now that the driver is exclusively an of_platform driver we no longer
use the structs and #defines in fsl_devices.h

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agokaweth: Fix locking to be SMP-safe
Larry Finger [Thu, 2 Apr 2009 08:09:43 +0000 (01:09 -0700)] 
kaweth: Fix locking to be SMP-safe

On an SMP system, the following message is printed. The patch below gets
fixes the problem.

=================================
[ INFO: inconsistent lock state ]
2.6.29-Linus-05093-gc31f403 #57
---------------------------------
inconsistent {hardirq-on-W} -> {in-hardirq-W} usage.
bash/4105 [HC1[1]:SC0[0]:HE0:SE1] takes:
 (&kaweth->device_lock){+...}, at: [<ffffffffa01aa286>]
                 kaweth_usb_receive+0x77/0x1af [kaw eth]
{hardirq-on-W} state was registered at:
  [<ffffffff80260503>] __lock_acquire+0x753/0x1685
  [<ffffffff8026148a>] lock_acquire+0x55/0x71
  [<ffffffff80461ba6>] _spin_lock+0x31/0x3d
  [<ffffffffa01aaa0c>] kaweth_start_xmit+0x2b/0x1e1 [kaweth]
  [<ffffffff803eccd3>] dev_hard_start_xmit+0x22e/0x2ad
  [<ffffffff803fe120>] __qdisc_run+0xf2/0x203
  [<ffffffff803ed0cd>] dev_queue_xmit+0x263/0x39b
  [<ffffffffa03a47cb>] packet_sendmsg_spkt+0x1c4/0x20a [af_packet]
  [<ffffffff803de0c2>] sock_sendmsg+0xe4/0xfd
  [<ffffffff803dec8f>] sys_sendto+0xe4/0x10c
  [<ffffffff8020bccb>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 1280
hardirqs last  enabled at (1279): [<ffffffff80461a71>]
                  _spin_unlock_irqrestore+0x44/0x4c
hardirqs last disabled at (1280): [<ffffffff8020bad7>]
                  save_args+0x67/0x70
softirqs last  enabled at (660): [<ffffffff8024192c>]
                  __do_softirq+0x14d/0x15d
softirqs last disabled at (651): [<ffffffff8020ce9c>]
                  call_softirq+0x1c/0x28

Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonet: allow multiple dev per napi with GRO
Stephen Hemminger [Wed, 1 Apr 2009 11:20:20 +0000 (11:20 +0000)] 
net: allow multiple dev per napi with GRO

GRO assumes that there is a one-to-one relationship between NAPI
structure and network device. Some devices like sky2 share multiple
devices on a single interrupt so only have one NAPI handler. Rather than
split GRO from NAPI, just have GRO assume if device changes that
it is a different flow.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agor8169: reset IntrStatus after chip reset
Karsten Wiese [Thu, 2 Apr 2009 08:06:01 +0000 (01:06 -0700)] 
r8169: reset IntrStatus after chip reset

Original comment (Karsten):
On a MSI MS-6702E mainboard, when in rtl8169_init_one() for the first time
after BIOS has run, IntrStatus reads 5 after chip has been reset.
IntrStatus should equal 0 there, so patch changes IntrStatus reset to happen
after chip reset instead of before.

Remark (Francois):
Assuming that the loglevel of the driver is increased above NETIF_MSG_INTR,
the bug reveals itself with a typical "interrupt 0025 in poll" message
at startup. In retrospect, the message should had been read as an hint of
an unexpected hardware state several months ago :o(

Fixes (at least part of) https://bugzilla.redhat.com/show_bug.cgi?id=460747

Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
Tested-by: Josep <josep.puigdemont@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring...
Mallikarjuna R Chilakala [Tue, 31 Mar 2009 21:35:24 +0000 (21:35 +0000)] 
ixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring parameters

While setting up the ring parameters using ethtool the driver can
panic or leak memory as ixgbe_open tries to setup tx & rx resources.
The updated logic will use ixgbe_down/up after successful allocation of
tx & rx resources

Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: stable@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: fix ethtool -A|a behavior
Don Skidmore [Tue, 31 Mar 2009 21:35:05 +0000 (21:35 +0000)] 
ixgbe: fix ethtool -A|a behavior

We were basicly ignoring ethtool users request for FC autoneg
and replying to queries with a "best guess".  This patch
enables the driver to store if we want to enable/disable
autoneg FC and do the correct behavior.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: Patch to fix driver panic while freeing up tx & rx resources
Mallikarjuna R Chilakala [Tue, 31 Mar 2009 21:34:44 +0000 (21:34 +0000)] 
ixgbe: Patch to fix driver panic while freeing up tx & rx resources

When network interface is made active we were not handling the error
scenarios properly to clean up rx & tx resources which might result in
a driver panic.

Signed-off-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: refactor tx buffer processing to use skb_dma_map/unmap
Alexander Duyck [Tue, 31 Mar 2009 21:34:23 +0000 (21:34 +0000)] 
ixgbe: refactor tx buffer processing to use skb_dma_map/unmap

This patch resolves an issue with map single being used to map a buffer and
then unmap page being used to unmap it.  In addition it handles any error
conditions that may be detected using skb_dma_map.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: Fix 82598 MSI-X allocation on systems with more than 8 CPU cores
PJ Waskiewicz [Tue, 31 Mar 2009 21:34:05 +0000 (21:34 +0000)] 
ixgbe: Fix 82598 MSI-X allocation on systems with more than 8 CPU cores

MSI-X allocation broke after the 82599 merge on systems with more than 8
CPU cores.  82598 drops back into MSI mode, which isn't sufficient to run
full, efficient 10G line rate.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: feature - driver to default with FC on.
Don Skidmore [Tue, 31 Mar 2009 21:33:44 +0000 (21:33 +0000)] 
ixgbe: feature - driver to default with FC on.

In the past flow control wasn't enabled by default under the
incorrect assumption that this opened up us to a denial of
service attack.  However since any switch that forwarded flow
control would be extremely msiconfigured and/or buggy, this
concern no longer out weighs the preformance gains from
having FC enabled.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: Fix DCB netlink layer for 82599 to enable Priority Flow Control
PJ Waskiewicz [Tue, 31 Mar 2009 21:33:25 +0000 (21:33 +0000)] 
ixgbe: Fix DCB netlink layer for 82599 to enable Priority Flow Control

The priority flow control settings from the netlink layer aren't taking
effect in the base driver.  The boolean pfc_mode_enable in the dcb_config
struct isn't being set, so the hardware configuration code is never
reached.

Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: Fix ethtool output with advertised mode.
Don Skidmore [Tue, 31 Mar 2009 21:33:02 +0000 (21:33 +0000)] 
ixgbe: Fix ethtool output with advertised mode.

Ethtool tries to get advertised speed from phy.autoneg_advertised.
However for copper media this wasn't happening until later do to
an other fix which moved mac.ops.setup_link_speed placement in
ixgbe_link_config(). This patch will display the default advertised
speeds if it can't yet get this information from phy.autoneg_advertised.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Acked-by: Mallikarjuna R Chilakala <mallikarjuna.chilakala@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoixgbe: fix build when DEBUG is defined
Alexander Duyck [Tue, 31 Mar 2009 21:32:42 +0000 (21:32 +0000)] 
ixgbe: fix build when DEBUG is defined

The ixgbe driver had issues when DEBUG was defined because the hw_dbg macro
was incomplete.  This patch completes the code based off of the code that
already existed in the igb module.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonet/igb: Fix kexec with igb (rev. 3)
Rafael J. Wysocki [Tue, 31 Mar 2009 21:23:50 +0000 (21:23 +0000)] 
net/igb: Fix kexec with igb (rev. 3)

Impact: Fix

Yinghai Lu found one system with 82575EB where, in the kernel that is
kexeced, probe igb failed with -2, the reason being that the adapter
could not be brought back from D3 by the kexec kernel, most probably
due to quirky hardware (it looks like the same behavior happened on
forcedeth).

Prevent igb from putting the adapter into D3 during shutdown except
when we going to power off the system.  For this purpose, seperate
igb_shutdown() from igb_suspend() and use the appropriate PCI PM
callbacks in both of them.

Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoigb: cleanup igb loopback path
Alexander Duyck [Tue, 31 Mar 2009 20:38:56 +0000 (20:38 +0000)] 
igb: cleanup igb loopback path

The code path for setting up phy loopback testing was out of date and was
setting bits it didn't need to.  This change cleans up the code path and
removes some code that has no effect on teh driver.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoigb: increase delay for copper link setup
Alexander Duyck [Tue, 31 Mar 2009 20:38:38 +0000 (20:38 +0000)] 
igb: increase delay for copper link setup

Increase the delay for copper phy init from 15ms to 100ms.  This is to
address issues seen in which ethtool -t was failing in some cases on 82576
based adapters.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoigb: set num_rx/tx_queues to 0 when queues are freed
Alexander Duyck [Tue, 31 Mar 2009 20:38:19 +0000 (20:38 +0000)] 
igb: set num_rx/tx_queues to 0 when queues are freed

An issue was seen on suspend in which the system reported a page fault.  This
was due to the new reg_idx code being called after the queues were freed.

This update prevents any for loops from going through the queues by setting
the number of queues to 0 when they are freed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoigb: add support for x2 link width configurations
Alexander Duyck [Tue, 31 Mar 2009 20:38:00 +0000 (20:38 +0000)] 
igb: add support for x2 link width configurations

When device is on PCIe link trained as x2 the driver is currently reporting
link width as "unknown".  The original patch provided by Myron adds the x2
link support and my changes are cosmetic to clean up the readability of the
conditional operators.

Based on work by: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonet/fec_mpc52xx: Don't dereference phy_device if it is NULL
Grant Likely [Tue, 31 Mar 2009 20:17:03 +0000 (20:17 +0000)] 
net/fec_mpc52xx: Don't dereference phy_device if it is NULL

The FEC Ethernet device isn't always attached to a phy.  Be careful
not to dereference phy_device if it is NULL.  Also eliminates an
unnecessary extra function from the ioctl path.

Reported-by: Henk Stegeman <henk.stegeman@gmail.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonet/fec_mpc52xx: Migrate to net_device_ops.
Henk Stegeman [Tue, 31 Mar 2009 20:16:57 +0000 (20:16 +0000)] 
net/fec_mpc52xx: Migrate to net_device_ops.

Since not using net_device_ops gets you shunned out the cool crowd,
this patch modifies the fec_mpc52xx Ethernet driver to provide the
management hooks via a struct net_device_ops.

Reported-by: Henk Stegeman <henk.stegeman@gmail.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonet/fec_mpc52xx: fix BUG on missing dma_ops
Grant Likely [Tue, 31 Mar 2009 20:16:52 +0000 (20:16 +0000)] 
net/fec_mpc52xx: fix BUG on missing dma_ops

The driver triggers a BUG_ON() when allocating DMA buffers because the
arch/powerpc dma_ops aren't in the net_device's struct device.  This
patch fixes the problem by using the parent of_device which does have
the correct dma_ops set.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Reviewed-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agonetfilter: use rcu_read_bh() in ipt_do_table()
Eric Dumazet [Thu, 2 Apr 2009 07:53:49 +0000 (00:53 -0700)] 
netfilter: use rcu_read_bh() in ipt_do_table()

Commit 784544739a25c30637397ace5489eeb6e15d7d49
(netfilter: iptables: lock free counters) forgot to disable BH
in arpt_do_table(), ipt_do_table() and  ip6t_do_table()

Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.

Reported-and-bisected-by: Roman Mindalev <r000n@r000n.net>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
15 years agoRDS: Use spinlock to protect 64b value update on 32b archs
Andy Grover [Wed, 1 Apr 2009 08:20:20 +0000 (08:20 +0000)] 
RDS: Use spinlock to protect 64b value update on 32b archs

We have a 64bit value that needs to be set atomically.
This is easy and quick on all 64bit archs, and can also be done
on x86/32 with set_64bit() (uses cmpxchg8b). However other
32b archs don't have this.

I actually changed this to the current state in preparation for
mainline because the old way (using a spinlock on 32b) resulted in
unsightly #ifdefs in the code. But obviously, being correct takes
precedence.

Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>