linux-2.6
15 years agoide-atapi: assign expiry and timeout based on device type
Borislav Petkov [Fri, 2 Jan 2009 15:12:55 +0000 (16:12 +0100)] 
ide-atapi: assign expiry and timeout based on device type

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: compute cmd_len based on device type in ide_transfer_pc
Borislav Petkov [Fri, 2 Jan 2009 15:12:54 +0000 (16:12 +0100)] 
ide-atapi: compute cmd_len based on device type in ide_transfer_pc

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
[bart: move cmd_len check closer to ->output_data() call]
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: remove the last ide-scsi remnants
Borislav Petkov [Fri, 2 Jan 2009 15:12:54 +0000 (16:12 +0100)] 
ide: remove the last ide-scsi remnants

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: remove ide-scsi remnants from ide_pc_intr()
Borislav Petkov [Fri, 2 Jan 2009 15:12:54 +0000 (16:12 +0100)] 
ide-atapi: remove ide-scsi remnants from ide_pc_intr()

As a result, remove now unused ide_scsi_get_timeout and ide_scsi_expiry.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: remove ide-scsi remnants from ide_transfer_pc()
Borislav Petkov [Fri, 2 Jan 2009 15:12:53 +0000 (16:12 +0100)] 
ide-atapi: remove ide-scsi remnants from ide_transfer_pc()

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: remove ide-scsi remnants from ide_issue_pc
Borislav Petkov [Fri, 2 Jan 2009 15:12:53 +0000 (16:12 +0100)] 
ide-atapi: remove ide-scsi remnants from ide_issue_pc

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-cd: move cdrom_timer_expiry to ide-atapi.c
Borislav Petkov [Fri, 2 Jan 2009 15:12:53 +0000 (16:12 +0100)] 
ide-cd: move cdrom_timer_expiry to ide-atapi.c

- cdrom_timer_expiry -> ide_cd_expiry
- remove expiry-arg to ide_issue_pc as it is redundant now
- ide_debug_log -> debug_log

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: teach ide atapi about drive->waiting_for_dma
Borislav Petkov [Fri, 2 Jan 2009 15:12:53 +0000 (16:12 +0100)] 
ide-atapi: teach ide atapi about drive->waiting_for_dma

In addition, we wait for DRQ to be asserted by repeatedly polling
device status no matter what DRQ type each device implements.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: accomodate transfer length calculation for ide-cd
Borislav Petkov [Fri, 2 Jan 2009 15:12:52 +0000 (16:12 +0100)] 
ide-atapi: accomodate transfer length calculation for ide-cd

... by factoring it out of ide_cd_do_request() into a helper, as suggested by
Bart.

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
[bart: BLK_DEV_IDECD needs to select IDE_ATAPI now]
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: setup dma for ide-cd
Borislav Petkov [Fri, 2 Jan 2009 15:12:52 +0000 (16:12 +0100)] 
ide-atapi: setup dma for ide-cd

There should be no functional change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: combine drive-specific assignments
Borislav Petkov [Fri, 2 Jan 2009 15:12:52 +0000 (16:12 +0100)] 
ide-atapi: combine drive-specific assignments

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-atapi: add a dev_is_idecd-inline
Borislav Petkov [Fri, 2 Jan 2009 15:12:52 +0000 (16:12 +0100)] 
ide-atapi: add a dev_is_idecd-inline

There should be no functionality change resulting from this patch.

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoremove ide-scsi
FUJITA Tomonori [Fri, 2 Jan 2009 15:12:51 +0000 (16:12 +0100)] 
remove ide-scsi

As planed, this removes ide-scsi.

The 2.6 kernel supports direct writing to ide CD drives, which
eliminates the need for ide-scsi. ide-scsi has been unmaintained and
marked as deprecated.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: James.Bottomley@HansenPartnership.com
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-floppy: allocate only toplevel packet commands
Linus Torvalds [Fri, 2 Jan 2009 15:12:51 +0000 (16:12 +0100)] 
ide-floppy: allocate only toplevel packet commands

This makes the top-level function just allocate a single pc entry, and then
pass it down as a pointer to all the helper functions that also need one
of those "struct ide_atapi_pc" things. As far as I can tell, the use of
these things never overlaps each other, BUT I DID NOT CHECK VERY CLOSELY!

So I'm not guaranteeing this is correct, and I don't have the hardware. It
would be good for somebody who knows the code more, and has the hardware,
could please test this?

With this, ide-floppy still has fairly big stack usage, but instead of

idefloppy_ioctl [vmlinux]:              1208
ide_floppy_get_capacity [vmlinux]:      872
idefloppy_release [vmlinux]:            408
idefloppy_open [vmlinux]:               408

where those two first ones are at the very top of the list of stack users
for me, it's now

ide_floppy_get_capacity [vmlinux]:           404
ide_floppy_ioctl [vmlinux]:                  364

ie they are still high, but they are no longer at the top.

Borislav: Since ide_floppy_get_capacity is passed as a function pointer to other
parts of the kernel (e.g., block layer) we need that ide_atapi_pc to be created
on stack. Also, redid stack users numbers above. The two functions missing from
Linus' original 'make stackusage' output are due to ide being
rewritten/reorganized atm.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: make IDE_AFLAG_.. numbering continuous again
Borislav Petkov [Fri, 2 Jan 2009 15:12:50 +0000 (16:12 +0100)] 
ide: make IDE_AFLAG_.. numbering continuous again

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide-cd: move debug defines into header
Borislav Petkov [Fri, 2 Jan 2009 15:12:50 +0000 (16:12 +0100)] 
ide-cd: move debug defines into header

While at it:
- disable compiling-in debug support by default

Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
[bart: fixup patch description]
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: use per-device request queue locks (v2)
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:50 +0000 (16:12 +0100)] 
ide: use per-device request queue locks (v2)

* Move hack for flush requests from choose_drive() to do_ide_request().

* Add ide_plug_device() helper and convert core IDE code from using
  per-hwgroup lock as a request lock to use the ->queue_lock instead.

* Remove no longer needed:
  - choose_drive() function
  - WAKEUP() macro
  - 'sleeping' flag from ide_hwif_t
  - 'service_{start,time}' fields from ide_drive_t

This patch results in much simpler and more maintainable code
(besides being a scalability improvement).

v2:
* Fixes/improvements based on review from Elias:
  - take as many requests off the queue as possible
  - remove now redundant BUG_ON()

Cc: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: add ide_[un]lock_hwgroup() helpers
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:50 +0000 (16:12 +0100)] 
ide: add ide_[un]lock_hwgroup() helpers

Add ide_[un]lock_hwgroup() inline helpers for obtaining exclusive
access to the given hwgroup and update the core code accordingly.

[ This change besides making code saner results in more efficient
  use of ide_{get,release}_lock(). ]

Cc: Michael Schmitz <schmitz@biophys.uni-duesseldorf.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: remove "paranoia" checks for hwgroup->busy
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:49 +0000 (16:12 +0100)] 
ide: remove "paranoia" checks for hwgroup->busy

Remove "paranoia" checks for hwgroup->busy from ide_timer_expiry()
and ide_intr().  This is a preparation for future changes.

Cc: Michael Schmitz <schmitz@biophys.uni-duesseldorf.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: remove IDE PM hack from do_ide_request()
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:49 +0000 (16:12 +0100)] 
ide: remove IDE PM hack from do_ide_request()

We now tell block layer that there is still work to do using
blk_plug_device() so hack for IDE Power Management can be removed
(it was buggy for hwgroups having more than 4 devices anyway).

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: don't execute the next queued command from the hard-IRQ context (v2)
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:48 +0000 (16:12 +0100)] 
ide: don't execute the next queued command from the hard-IRQ context (v2)

* Tell the block layer that we are not done handling requests by using
  blk_plug_device() in ide_do_request() (request handling function)
  and ide_timer_expiry() (timeout handler) if the queue is not empty.

* Remove optimization which directly calls ide_do_request() for the next
  queued command from the ide_intr() (IRQ handler) and ide_timer_expiry().

* Remove no longer needed IRQ masking from ide_do_request() - in case of
  IDE ports needing serialization disable_irq_nosync()/enable_irq() was
  used for the (possibly shared) IRQ of the other IDE port.

* Put the misplaced comment in the right place in ide_do_request().

* Drop no longer needed 'int masked_irq' argument from ide_do_request().

* Merge ide_do_request() into do_ide_request().

* Remove no longer needed IDE_NO_IRQ define.

While at it:

* Don't use HWGROUP() macro in do_ide_request().

* Use __func__ in ide_intr().

This patch reduces IRQ hadling latency for IDE and improves the system-wide
handling of shared IRQs (which should result in more timeout resistant and
stable IDE systems).  It also makes it possible to do some further changes
later (i.e. replace some busy-waiting delays with sleeping equivalents).

v2:
Changes per review from Elias Oltmanns:
- fix wrong goto statement in 'if (startstop == ide_stopped)' block
- use spin_unlock_irq()
- don't use obsolete HWIF() macro

Cc: Elias Oltmanns <eo@nebensachen.de>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: move sysfs support to ide-sysfs.c
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:48 +0000 (16:12 +0100)] 
ide: move sysfs support to ide-sysfs.c

While at it:
- media_string() -> ide_media_string()

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: factor out device type classifying from do_identify()
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:47 +0000 (16:12 +0100)] 
ide: factor out device type classifying from do_identify()

Factor out device type classifying from do_identify()
to ide_classify_ata_dev() and ide_classify_atapi_dev().

There should be no functional changes caused by this patch.

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: small ide_register_port() cleanup
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:47 +0000 (16:12 +0100)] 
ide: small ide_register_port() cleanup

Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoide: remove chipset type fixup from ide_host_register()
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:47 +0000 (16:12 +0100)] 
ide: remove chipset type fixup from ide_host_register()

* Set chipset type explicitly in tx4938ide and tx4939ide host drivers
  (all other host drivers were updated already).

* Remove no longer used chipset type fixup from ide_host_register().

Acked-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agotx493x: fix indentation
Bartlomiej Zolnierkiewicz [Fri, 2 Jan 2009 15:12:46 +0000 (16:12 +0100)] 
tx493x: fix indentation

Trivial CodingStyle fixup for tx4938ide and tx4939ide drivers.

Acked-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Acked-by: Sergei Shtyltov <sshtylyov@ru.mvista.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
15 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux...
David Vrabel [Fri, 2 Jan 2009 13:17:13 +0000 (13:17 +0000)] 
Merge branch 'master' of git://git./linux/kernel/git/torvalds/linux-2.6 into for-upstream

Conflicts:

drivers/uwb/wlp/eda.c

15 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Linus Torvalds [Wed, 31 Dec 2008 23:57:56 +0000 (15:57 -0800)] 
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (34 commits)
  nfsd race fixes: jfs
  nfsd race fixes: reiserfs
  nfsd race fixes: ext4
  nfsd race fixes: ext3
  nfsd race fixes: ext2
  nfsd/create race fixes, infrastructure
  filesystem notification: create fs/notify to contain all fs notification
  fs/block_dev.c: __read_mostly improvement and sb_is_blkdev_sb utilization
  kill ->dir_notify()
  filp_cachep can be static in fs/file_table.c
  fix f_count description in Documentation/filesystems/files.txt
  make INIT_FS use the __RW_LOCK_UNLOCKED initialization
  take init_fs to saner place
  kill vfs_permission
  pass a struct path * to may_open
  kill walk_init_root
  remove incorrect comment in inode_permission
  expand some comments (d_path / seq_path)
  correct wrong function name of d_put in kernel document and source comment
  fix switch_names() breakage in short-to-short case
  ...

15 years agonfsd race fixes: jfs
Dave Kleikamp [Wed, 31 Dec 2008 04:08:37 +0000 (22:08 -0600)] 
nfsd race fixes: jfs

jfs version of Al Viro's nfsd race patches

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agonfsd race fixes: reiserfs
Al Viro [Tue, 30 Dec 2008 07:03:58 +0000 (02:03 -0500)] 
nfsd race fixes: reiserfs

... and the same for reiserfs.  The difference here is that we need
insert_inode_locked4() to match iget5_locked().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agonfsd race fixes: ext4
Al Viro [Tue, 30 Dec 2008 07:03:31 +0000 (02:03 -0500)] 
nfsd race fixes: ext4

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agonfsd race fixes: ext3
Al Viro [Tue, 30 Dec 2008 07:02:50 +0000 (02:02 -0500)] 
nfsd race fixes: ext3

ext3 analog of the previous patch

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agonfsd race fixes: ext2
Al Viro [Tue, 30 Dec 2008 06:52:35 +0000 (01:52 -0500)] 
nfsd race fixes: ext2

* make ext2_new_inode() put the inode into icache in locked state
* do not unlock until the inode is fully set up; otherwise nfsd
might pick it in half-baked state.
* make sure that ext2_new_inode() does *not* lead to two inodes with the
same inumber hashed at the same time; otherwise a bogus fhandle coming
from nfsd might race with inode creation:

nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
ext2_new_inode(): allocate inode with that inumber
ext2_new_inode(): insert it into icache, set it up and dirty
ext2_write_inode(): get the relevant part of inode table in cache,
set the entry for our inode (and start writing to disk)
nfsd: get CPU again, look into inode table, see nice and sane on-disk
inode, set the in-core inode from it

oops - we have two in-core inodes with the same inumber live in icache,
both used for IO.  Welcome to fs corruption...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agonfsd/create race fixes, infrastructure
Al Viro [Tue, 30 Dec 2008 06:48:21 +0000 (01:48 -0500)] 
nfsd/create race fixes, infrastructure

new helpers - insert_inode_locked() and insert_inode_locked4().
Hash new inode, making sure that there's no such inode in icache
already.  If there is and it does not end up unhashed (as would
happen if we have nfsd trying to resolve a bogus fhandle), fail.
Otherwise insert our inode into hash and succeed.

In either case have i_state set to new+locked; cleanup ends up
being simpler with such calling conventions.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofilesystem notification: create fs/notify to contain all fs notification
Eric Paris [Wed, 17 Dec 2008 18:59:41 +0000 (13:59 -0500)] 
filesystem notification: create fs/notify to contain all fs notification

Creating a generic filesystem notification interface, fsnotify, which will be
used by inotify, dnotify, and eventually fanotify is really starting to
clutter the fs directory.  This patch simply moves inotify and dnotify into
fs/notify/inotify and fs/notify/dnotify respectively to make both current fs/
and future notification tidier.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofs/block_dev.c: __read_mostly improvement and sb_is_blkdev_sb utilization
Denis ChengRq [Mon, 1 Dec 2008 22:34:56 +0000 (14:34 -0800)] 
fs/block_dev.c: __read_mostly improvement and sb_is_blkdev_sb utilization

- iget5_locked in bdget really needs blockdev_superblock, instead of
  bd_mnt, so bd_mnt could be just a local variable;

- blockdev_superblock really needs __read_mostly, while local var bd_mnt
  not;

- make use of sb_is_blkdev_sb in bd_forget, instead of direct reference
  to blockdev_superblock.

Signed-off-by: Denis ChengRq <crquan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agokill ->dir_notify()
Al Viro [Fri, 26 Dec 2008 05:57:40 +0000 (00:57 -0500)] 
kill ->dir_notify()

Remove the hopelessly misguided ->dir_notify().  The only instance (cifs)
has been broken by design from the very beginning; the objects it creates
are never destroyed, keep references to struct file they can outlive, nothing
that could possibly evict them exists on close(2) path *and* no locking
whatsoever is done to prevent races with close(), should the previous, er,
deficiencies someday be dealt with.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofilp_cachep can be static in fs/file_table.c
Eric Dumazet [Wed, 10 Dec 2008 17:35:45 +0000 (09:35 -0800)] 
filp_cachep can be static in fs/file_table.c

Instead of creating the "filp" kmem_cache in vfs_caches_init(),
we can do it a litle be later in files_init(), so that filp_cachep
is static to fs/file_table.c

Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofix f_count description in Documentation/filesystems/files.txt
Eric Dumazet [Wed, 10 Dec 2008 17:35:45 +0000 (09:35 -0800)] 
fix f_count description in Documentation/filesystems/files.txt

Documentation/filesystems/files.txt was not updated when
f_count became an atomic_long_t.
atomic_long_inc_not_zero() is now used instead of atomic_inc_not_zero()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agomake INIT_FS use the __RW_LOCK_UNLOCKED initialization
Steven Rostedt [Wed, 10 Dec 2008 23:37:28 +0000 (18:37 -0500)] 
make INIT_FS use the __RW_LOCK_UNLOCKED initialization

[AV: rediffed on top of unification of init_fs]
Initialization of init_fs still uses the deprecated RW_LOCK_UNLOCKED macro.
This patch updates it to use the __RW_LOCK_UNLOCKED(lock) macro.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agotake init_fs to saner place
Al Viro [Fri, 26 Dec 2008 05:35:37 +0000 (00:35 -0500)] 
take init_fs to saner place

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agokill vfs_permission
Christoph Hellwig [Fri, 24 Oct 2008 07:59:29 +0000 (09:59 +0200)] 
kill vfs_permission

With all the nameidata removal there's no point anymore for this helper.
Of the three callers left two will go away with the next lookup series
anyway.

Also add proper kerneldoc to inode_permission as this is the main
permission check routine now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agopass a struct path * to may_open
Christoph Hellwig [Fri, 24 Oct 2008 07:58:10 +0000 (09:58 +0200)] 
pass a struct path * to may_open

No need for the nameidata in may_open - a struct path is enough.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agokill walk_init_root
Christoph Hellwig [Wed, 5 Nov 2008 14:07:21 +0000 (15:07 +0100)] 
kill walk_init_root

walk_init_root is a tiny helper that is marked __always_inline, has just
one caller and an unused argument.  Just merge it into the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoremove incorrect comment in inode_permission
Christoph Hellwig [Wed, 5 Nov 2008 14:04:29 +0000 (15:04 +0100)] 
remove incorrect comment in inode_permission

We now pass on all MAY_ flags to the filesystems permission routines,
so remove the comment stating the contrary.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoexpand some comments (d_path / seq_path)
Arjan van de Ven [Mon, 1 Dec 2008 22:35:00 +0000 (14:35 -0800)] 
expand some comments (d_path / seq_path)

Explain that you really need to use the return value of d_path rather than
the buffer you passed into it.

Also fix the comment for seq_path(), the function arguments changed
recently but the comment hadn't been updated in sync.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agocorrect wrong function name of d_put in kernel document and source comment
Zhaolei [Mon, 1 Dec 2008 22:34:58 +0000 (14:34 -0800)] 
correct wrong function name of d_put in kernel document and source comment

no function named d_put(), it should be dput().

Impact: fix document and comment, no functionality changed

Signed-off-by: Zhao Lei <zhaolei@cn.fuijtsu.com>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofix switch_names() breakage in short-to-short case
Al Viro [Mon, 3 Nov 2008 20:03:50 +0000 (15:03 -0500)] 
fix switch_names() breakage in short-to-short case

We want ->name.len to match the resulting name on *both*
source and target

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agobefs: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:18 +0000 (20:47 +0000)] 
befs: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Sergey S. Kostyliov <rathamahata@php4.ru>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofreevxfs: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:17 +0000 (20:47 +0000)] 
freevxfs: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agosysv: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:16 +0000 (20:47 +0000)] 
sysv: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoext4: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:15 +0000 (20:47 +0000)] 
ext4: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: adilger@sun.com
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoext3: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:14 +0000 (20:47 +0000)] 
ext3: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoext2: ensure fast symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:13 +0000 (20:47 +0000)] 
ext2: ensure fast symlinks are NUL-terminated

Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agovfs: ensure page symlinks are NUL-terminated
Duane Griffin [Fri, 19 Dec 2008 20:47:12 +0000 (20:47 +0000)] 
vfs: ensure page symlinks are NUL-terminated

On-disk data corruption could cause a page link to have its i_size set
to PAGE_SIZE (or a multiple thereof) and its contents all non-NUL.
NUL-terminate the link name to ensure this doesn't cause further
problems for the kernel.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agovfs: introduce helper function to safely NUL-terminate symlinks
Duane Griffin [Fri, 19 Dec 2008 20:47:11 +0000 (20:47 +0000)] 
vfs: introduce helper function to safely NUL-terminate symlinks

A number of filesystems were potentially triggering kernel bugs due to
corrupted symlink names on disk. This function helps safely terminate
the names.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoeCryptfs: check readlink result was not an error before using it
Duane Griffin [Fri, 19 Dec 2008 20:47:10 +0000 (20:47 +0000)] 
eCryptfs: check readlink result was not an error before using it

The result from readlink is being used to index into the link name
buffer without checking whether it is a valid length. If readlink
returns an error this will fault or cause memory corruption.

Cc: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Acked-by: Michael Halcrow <mhalcrow@us.ibm.com>
Acked-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofs/namespace.c: drop code after return
Julia Lawall [Mon, 1 Dec 2008 22:34:51 +0000 (14:34 -0800)] 
fs/namespace.c: drop code after return

The extra semicolon serves no purpose.

Signed-off-by: Julia Lawall <julia@diku.dk>
Reviewed-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoinclude: linux/fs.h: put declarations in __KERNEL__
Jan Engelhardt [Mon, 1 Dec 2008 22:34:50 +0000 (14:34 -0800)] 
include: linux/fs.h: put declarations in __KERNEL__

include/linux/fs.h contains externs for a bunch of variables.  That obviously
belongs under ifdef __KERNEL__.

Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoshrink struct dentry
Nick Piggin [Mon, 1 Dec 2008 08:33:43 +0000 (09:33 +0100)] 
shrink struct dentry

struct dentry is one of the most critical structures in the kernel. So it's
sad to see it going neglected.

With CONFIG_PROFILING turned on (which is probably the common case at least
for distros and kernel developers), sizeof(struct dcache) == 208 here
(64-bit). This gives 19 objects per slab.

I packed d_mounted into a hole, and took another 4 bytes off the inline
name length to take the padding out from the end of the structure. This
shinks it to 200 bytes. I could have gone the other way and increased the
length to 40, but I'm aiming for a magic number, read on...

I then got rid of the d_cookie pointer. This shrinks it to 192 bytes. Rant:
why was this ever a good idea? The cookie system should increase its hash
size or use a tree or something if lookups are a problem. Also the "fast
dcookie lookups" in oprofile should be moved into the dcookie code -- how
can oprofile possibly care about the dcookie_mutex? It gets dropped after
get_dcookie() returns so it can't be providing any sort of protection.

At 192 bytes, 21 objects fit into a 4K page, saving about 3MB on my system
with ~140 000 entries allocated. 192 is also a multiple of 64, so we get
nice cacheline alignment on 64 and 32 byte line systems -- any given dentry
will now require 3 cachelines to touch all fields wheras previously it
would require 4.

I know the inline name size was chosen quite carefully, however with the
reduction in cacheline footprint, it should actually be just about as fast
to do a name lookup for a 36 character name as it was before the patch (and
faster for other sizes). The memory footprint savings for names which are
<= 32 or > 36 bytes long should more than make up for the memory cost for
33-36 byte names.

Performance is a feature...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agofs: reorder struct inotify_device on 64bits to remove padding
Richard Kennedy [Thu, 4 Dec 2008 11:17:47 +0000 (11:17 +0000)] 
fs: reorder struct inotify_device on 64bits to remove padding

Reorder struct inotify_device to remove 8 bytes of padding on 64bit
builds, reducing size to 128 bytes . Therefore allocating from a smaller
slab & using one fewer cachelines.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
----
Hi,
patch against 2.6.28-rc7.
built & tested on AMDX2 desktop.

I've not been able to send this to the listed inotify maintainers, I
just get mail failures. So I guessed filesystem was the best home for
it, hope that's ok.

regards
Richard
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agointroduce new LSM hooks where vfsmount is available.
Kentaro Takeda [Wed, 17 Dec 2008 04:24:15 +0000 (13:24 +0900)] 
introduce new LSM hooks where vfsmount is available.

Add new LSM hooks for path-based checks.  Call them on directory-modifying
operations at the points where we still know the vfsmount involved.

Signed-off-by: Kentaro Takeda <takedakn@nttdata.co.jp>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Toshiharu Harada <haradats@nttdata.co.jp>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
15 years agoMerge branch 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Wed, 31 Dec 2008 17:00:59 +0000 (09:00 -0800)] 
Merge branch 'irq-fixes-for-linus-4' of git://git./linux/kernel/git/tip/linux-2.6-tip

* 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  sparseirq: move __weak symbols into separate compilation unit
  sparseirq: work around __weak alias bug
  sparseirq: fix hang with !SPARSE_IRQ
  sparseirq: set lock_class for legacy irq when sparse_irq is selected
  sparseirq: work around compiler optimizing away __weak functions
  sparseirq: fix desc->lock init
  sparseirq: do not printk when migrating IRQ descriptors
  sparseirq: remove duplicated arch_early_irq_init()
  irq: simplify for_each_irq_desc() usage
  proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
  irq: for_each_irq_desc() move to irqnr.h
  hrtimer: remove #include <linux/irq.h>

15 years agoKVM: MMU: handle large host sptes on invlpg/resync
Marcelo Tosatti [Mon, 22 Dec 2008 20:49:30 +0000 (18:49 -0200)] 
KVM: MMU: handle large host sptes on invlpg/resync

The invlpg and sync walkers lack knowledge of large host sptes,
descending to non-existant pagetable level.

Stop at directory level in such case.

Fixes SMP Windows XP with hugepages.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: Add locking to virtual i8259 interrupt controller
Avi Kivity [Sun, 21 Dec 2008 20:48:32 +0000 (22:48 +0200)] 
KVM: Add locking to virtual i8259 interrupt controller

While most accesses to the i8259 are with the kvm mutex taken, the call
to kvm_pic_read_irq() is not.  We can't easily take the kvm mutex there
since the function is called with interrupts disabled.

Fix by adding a spinlock to the virtual interrupt controller.  Since we
can't send an IPI under the spinlock (we also take the same spinlock in
an irq disabled context), we defer the IPI until the spinlock is released.
Similarly, we defer irq ack notifications until after spinlock release to
avoid lock recursion.

Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: Don't treat a global pte as such if cr4.pge is cleared
Avi Kivity [Sun, 21 Dec 2008 16:31:10 +0000 (18:31 +0200)] 
KVM: MMU: Don't treat a global pte as such if cr4.pge is cleared

The pte.g bit is meaningless if global pages are disabled; deferring
mmu page synchronization on these ptes will lead to the guest using stale
shadow ptes.

Fixes Vista x86 smp bootloader failure.

Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoMAINTAINERS: Maintainership changes for kvm/ia64
Xiantao Zhang [Wed, 17 Dec 2008 01:38:14 +0000 (09:38 +0800)] 
MAINTAINERS: Maintainership changes for kvm/ia64

Anthony Xu no longer works on kvm.

Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ia64: Fix kvm_arch_vcpu_ioctl_[gs]et_regs()
Jes Sorensen [Tue, 16 Dec 2008 15:45:47 +0000 (16:45 +0100)] 
KVM: ia64: Fix kvm_arch_vcpu_ioctl_[gs]et_regs()

Fix kvm_arch_vcpu_ioctl_[gs]et_regs() to do something meaningful on
ia64. Old versions could never have worked since they required
pointers to be set in the ioctl payload which were never being set by
the ioctl handler for get_regs.

In addition reserve extra space for future extensions.

The change of layout of struct kvm_regs doesn't require adding a new
CAP since get/set regs never worked on ia64 until now.

This version doesn't support copying the KVM kernel stack in/out of
the kernel. This should be implemented in a seperate ioctl call if
ever needed.

Signed-off-by: Jes Sorensen <jes@sgi.com>
Acked-by : Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI
Jan Kiszka [Thu, 11 Dec 2008 15:54:54 +0000 (16:54 +0100)] 
KVM: x86: Rework user space NMI injection as KVM_CAP_USER_NMI

There is no point in doing the ready_for_nmi_injection/
request_nmi_window dance with user space. First, we don't do this for
in-kernel irqchip anyway, while the code path is the same as for user
space irqchip mode. And second, there is nothing to loose if a pending
NMI is overwritten by another one (in contrast to IRQs where we have to
save the number). Actually, there is even the risk of raising spurious
NMIs this way because the reason for the held-back NMI might already be
handled while processing the first one.

Therefore this patch creates a simplified user space NMI injection
interface, exporting it under KVM_CAP_USER_NMI and dropping the old
KVM_CAP_NMI capability. And this time we also take care to provide the
interface only on archs supporting NMIs via KVM (right now only x86).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip
Jan Kiszka [Mon, 24 Nov 2008 11:26:19 +0000 (12:26 +0100)] 
KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip

As with the kernel irqchip, don't allow an NMI to stomp over an already
injected IRQ; instead wait for the IRQ injection to be completed.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: fix handling of ACK from shared guest IRQ
Mark McLoughlin [Tue, 2 Dec 2008 12:16:33 +0000 (12:16 +0000)] 
KVM: fix handling of ACK from shared guest IRQ

If an assigned device shares a guest irq with an emulated
device then we currently interpret an ack generated by the
emulated device as originating from the assigned device
leading to e.g. "Unbalanced enable for IRQ 4347" from the
enable_irq() in kvm_assigned_dev_ack_irq().

The fix is fairly simple - don't enable the physical device
irq unless it was previously disabled.

Of course, this can still lead to a situation where a
non-assigned device ACK can cause the physical device irq to
be reenabled before the device was serviced. However, being
level sensitive, the interrupt will merely be regenerated.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: check for present pdptr shadow page in walk_shadow
Marcelo Tosatti [Tue, 9 Dec 2008 15:07:22 +0000 (16:07 +0100)] 
KVM: MMU: check for present pdptr shadow page in walk_shadow

walk_shadow assumes the caller verified validity of the pdptr pointer in
question, which is not the case for the invlpg handler.

Fixes oops during Solaris 10 install.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: Consolidate userspace memory capability reporting into common code
Avi Kivity [Mon, 8 Dec 2008 16:29:29 +0000 (18:29 +0200)] 
KVM: Consolidate userspace memory capability reporting into common code

Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: Advertise the bug in memory region destruction as fixed
Avi Kivity [Mon, 8 Dec 2008 16:25:27 +0000 (18:25 +0200)] 
KVM: Advertise the bug in memory region destruction as fixed

Userspace might need to act differently.

Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: use cpumask_var_t for cpus_hardware_enabled
Rusty Russell [Sun, 7 Dec 2008 10:55:45 +0000 (21:25 +1030)] 
KVM: use cpumask_var_t for cpus_hardware_enabled

This changes cpus_hardware_enabled from a cpumask_t to a cpumask_var_t:
equivalent for CONFIG_CPUMASKS_OFFSTACK=n, otherwise dynamically allocated.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: use modern cpumask primitives, no cpumask_t on stack
Rusty Russell [Mon, 8 Dec 2008 09:58:04 +0000 (20:28 +1030)] 
KVM: use modern cpumask primitives, no cpumask_t on stack

We're getting rid on on-stack cpumasks for large NR_CPUS.

1) Use cpumask_var_t/alloc_cpumask_var.
2) smp_call_function_mask -> smp_call_function_many
3) cpus_clear, cpus_empty, cpu_set -> cpumask_clear, cpumask_empty,
   cpumask_set_cpu.

This actually generates slightly smaller code than the old one with
CONFIG_CPUMASKS_OFFSTACK=n.  (gcc knows that cpus cannot be NULL in
that case, where cpumask_var_t is cpumask_t[1]).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus
Rusty Russell [Mon, 8 Dec 2008 09:56:24 +0000 (20:26 +1030)] 
KVM: Extract core of kvm_flush_remote_tlbs/kvm_reload_remote_mmus

Avi said:
> Wow, code duplication from Rusty. Things must be bad.

Something about glass houses comes to mind.  But instead, a patch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: set owner of cpu and vm file operations
Christian Borntraeger [Tue, 2 Dec 2008 10:17:32 +0000 (11:17 +0100)] 
KVM: set owner of cpu and vm file operations

There is a race between a "close of the file descriptors" and module
unload in the kvm module.

You can easily trigger this problem by applying this debug patch:
>--- kvm.orig/virt/kvm/kvm_main.c
>+++ kvm/virt/kvm/kvm_main.c
>@@ -648,10 +648,14 @@ void kvm_free_physmem(struct kvm *kvm)
>                kvm_free_physmem_slot(&kvm->memslots[i], NULL);
> }
>
>+#include <linux/delay.h>
> static void kvm_destroy_vm(struct kvm *kvm)
> {
>        struct mm_struct *mm = kvm->mm;
>
>+       printk("off1\n");
>+       msleep(5000);
>+       printk("off2\n");
>        spin_lock(&kvm_lock);
>        list_del(&kvm->vm_list);
>        spin_unlock(&kvm_lock);

and killing the userspace, followed by an rmmod.

The problem is that kvm_destroy_vm can run while the module count
is 0. That means, you can remove the module while kvm_destroy_vm
is running. But kvm_destroy_vm is part of the module text. This
causes a kerneloops. The race exists without the msleep but is much
harder to trigger.

This patch requires the fix for anon_inodes (anon_inodes: use fops->owner
for module refcount).
With this patch, we can set the owner of all anonymous KVM inodes file
operations. The VFS will then control the KVM module refcount as long as there
is an open file. kvm_destroy_vm will be called by the release function of the
last closed file - before the VFS drops the module refcount.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoanon_inodes: use fops->owner for module refcount
Christian Borntraeger [Tue, 2 Dec 2008 10:16:03 +0000 (11:16 +0100)] 
anon_inodes: use fops->owner for module refcount

There is an imbalance for anonymous inodes. If the fops->owner field is set,
the module reference count of owner is decreases on release.
("filp_close" --> "__fput" ---> "fops_put")

On the other hand, anon_inode_getfd does not increase the module reference
count of owner. This causes two problems:

- if owner is set, the module refcount goes negative
- if owner is not set, the module can be unloaded while code is running

This patch changes anon_inode_getfd to be symmetric regarding fops->owner
handling.

I have checked all existing users of anon_inode_getfd. Noone sets fops->owner,
thats why nobody has seen the module refcount negative. The refcounting was
tested with a patched and unpatched KVM module.(see patch 2/2) I also did an
epoll_open/close test.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agox86: KVM guest: kvm_get_tsc_khz: return khz, not lpj
Eduardo Habkost [Fri, 5 Dec 2008 20:36:45 +0000 (18:36 -0200)] 
x86: KVM guest: kvm_get_tsc_khz: return khz, not lpj

kvm_get_tsc_khz() currently returns the previously-calculated preset_lpj
value, but it is in loops-per-jiffy, not kHz. The current code works
correctly only when HZ=1000.

Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: prepopulate the shadow on invlpg
Marcelo Tosatti [Tue, 2 Dec 2008 00:32:05 +0000 (22:32 -0200)] 
KVM: MMU: prepopulate the shadow on invlpg

If the guest executes invlpg, peek into the pagetable and attempt to
prepopulate the shadow entry.

Also stop dirty fault updates from interfering with the fork detector.

2% improvement on RHEL3/AIM7.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: skip global pgtables on sync due to cr3 switch
Marcelo Tosatti [Tue, 2 Dec 2008 00:32:04 +0000 (22:32 -0200)] 
KVM: MMU: skip global pgtables on sync due to cr3 switch

Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is
important for Linux 32-bit guests with PAE, where the kmap page is
marked as global.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: collapse remote TLB flushes on root sync
Marcelo Tosatti [Tue, 2 Dec 2008 00:32:03 +0000 (22:32 -0200)] 
KVM: MMU: collapse remote TLB flushes on root sync

Collapse remote TLB flushes on root sync.

kernbench is 2.7% faster on 4-way guest. Improvements have been seen
with other loads such as AIM7.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: MMU: use page array in unsync walk
Marcelo Tosatti [Tue, 2 Dec 2008 00:32:02 +0000 (22:32 -0200)] 
KVM: MMU: use page array in unsync walk

Instead of invoking the handler directly collect pages into
an array so the caller can work with it.

Simplifies TLB flush collapsing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: Fix handling of VMMCALL instruction
Amit Shah [Thu, 4 Dec 2008 11:11:40 +0000 (11:11 +0000)] 
KVM: x86 emulator: Fix handling of VMMCALL instruction

The VMMCALL instruction doesn't get recognised and isn't processed
by the emulator.

This is seen on an Intel host that tries to execute the VMMCALL
instruction after a guest live migrates from an AMD host.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: add the emulation of shld and shrd instructions
Guillaume Thouvenin [Thu, 4 Dec 2008 13:30:13 +0000 (14:30 +0100)] 
KVM: x86 emulator: add the emulation of shld and shrd instructions

Add emulation of shld and shrd instructions

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: add the assembler code for three operands
Guillaume Thouvenin [Thu, 4 Dec 2008 13:29:00 +0000 (14:29 +0100)] 
KVM: x86 emulator: add the assembler code for three operands

Add the assembler code for instruction with three operands and one
operand is stored in ECX register

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: add a new "implied 1" Src decode type
Guillaume Thouvenin [Thu, 4 Dec 2008 13:27:38 +0000 (14:27 +0100)] 
KVM: x86 emulator: add a new "implied 1" Src decode type

Add SrcOne operand type when we need to decode an implied '1' like with
regular shift instruction

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: add Src2 decode set
Guillaume Thouvenin [Thu, 4 Dec 2008 13:26:42 +0000 (14:26 +0100)] 
KVM: x86 emulator: add Src2 decode set

Instruction like shld has three operands, so we need to add a Src2
decode set. We start with Src2None, Src2CL, and Src2ImmByte, Src2One to
support shld/shrd and we will expand it later.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: x86 emulator: Extend the opcode descriptor
Guillaume Thouvenin [Thu, 4 Dec 2008 13:25:38 +0000 (14:25 +0100)] 
KVM: x86 emulator: Extend the opcode descriptor

Extend the opcode descriptor to 32 bits. This is needed by the
introduction of a new Src2 operand type.

Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: Really remove a slot when a user ask us so
Glauber Costa [Wed, 3 Dec 2008 15:40:51 +0000 (13:40 -0200)] 
KVM: Really remove a slot when a user ask us so

Right now, KVM does not remove a slot when we do a
register ioctl for size 0 (would be the expected behaviour).

Instead, we only mark it as empty, but keep all bitmaps
and allocated data structures present. It completely
nullifies our chances of reusing that same slot again
for mapping a different piece of memory.

In this patch, we destroy rmaps, and vfree() the
pointers that used to hold the dirty bitmap, rmap
and lpage_info structures.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ppc: mostly cosmetic updates to the exit timing accounting code
Hollis Blanchard [Tue, 2 Dec 2008 21:51:58 +0000 (15:51 -0600)] 
KVM: ppc: mostly cosmetic updates to the exit timing accounting code

The only significant changes were to kvmppc_exit_timing_write() and
kvmppc_exit_timing_show(), both of which were dramatically simplified.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ppc: Implement in-kernel exit timing statistics
Hollis Blanchard [Tue, 2 Dec 2008 21:51:57 +0000 (15:51 -0600)] 
KVM: ppc: Implement in-kernel exit timing statistics

Existing KVM statistics are either just counters (kvm_stat) reported for
KVM generally or trace based aproaches like kvm_trace.
For KVM on powerpc we had the need to track the timings of the different exit
types. While this could be achieved parsing data created with a kvm_trace
extension this adds too much overhead (at least on embedded PowerPC) slowing
down the workloads we wanted to measure.

Therefore this patch adds a in-kernel exit timing statistic to the powerpc kvm
code. These statistic is available per vm&vcpu under the kvm debugfs directory.
As this statistic is low, but still some overhead it can be enabled via a
.config entry and should be off by default.

Since this patch touched all powerpc kvm_stat code anyway this code is now
merged and simplified together with the exit timing statistic code (still
working with exit timing disabled in .config).

Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ppc: save and restore guest mappings on context switch
Hollis Blanchard [Tue, 2 Dec 2008 21:51:56 +0000 (15:51 -0600)] 
KVM: ppc: save and restore guest mappings on context switch

Store shadow TLB entries in memory, but only use it on host context switch
(instead of every guest entry). This improves performance for most workloads on
440 by reducing the guest TLB miss rate.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ppc: directly insert shadow mappings into the hardware TLB
Hollis Blanchard [Tue, 2 Dec 2008 21:51:55 +0000 (15:51 -0600)] 
KVM: ppc: directly insert shadow mappings into the hardware TLB

Formerly, we used to maintain a per-vcpu shadow TLB and on every entry to the
guest would load this array into the hardware TLB. This consumed 1280 bytes of
memory (64 entries of 16 bytes plus a struct page pointer each), and also
required some assembly to loop over the array on every entry.

Instead of saving a copy in memory, we can just store shadow mappings directly
into the hardware TLB, accepting that the host kernel will clobber these as
part of the normal 440 TLB round robin. When we do that we need less than half
the memory, and we have decreased the exit handling time for all guest exits,
at the cost of increased number of TLB misses because the host overwrites some
guest entries.

These savings will be increased on processors with larger TLBs or which
implement intelligent flush instructions like tlbivax (which will avoid the
need to walk arrays in software).

In addition to that and to the code simplification, we have a greater chance of
leaving other host userspace mappings in the TLB, instead of forcing all
subsequent tasks to re-fault all their mappings.

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agopowerpc/44x: declare tlb_44x_index for use in C code
Hollis Blanchard [Tue, 2 Dec 2008 21:51:54 +0000 (15:51 -0600)] 
powerpc/44x: declare tlb_44x_index for use in C code

KVM currently ignores the host's round robin TLB eviction selection, instead
maintaining its own TLB state and its own round robin index. However, by
participating in the normal 44x TLB selection, we can drop the alternate TLB
processing in KVM. This results in a significant performance improvement,
since that processing currently must be done on *every* guest exit.

Accordingly, KVM needs to be able to access and increment tlb_44x_index.
(KVM on 440 cannot be a module, so there is no need to export this symbol.)

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: ppc: support large host pages
Hollis Blanchard [Tue, 2 Dec 2008 21:51:53 +0000 (15:51 -0600)] 
KVM: ppc: support large host pages

KVM on 440 has always been able to handle large guest mappings with 4K host
pages -- we must, since the guest kernel uses 256MB mappings.

This patch makes KVM work when the host has large pages too (tested with 64K).

Signed-off-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: split out kvm_free_assigned_irq()
Mark McLoughlin [Mon, 1 Dec 2008 13:57:49 +0000 (13:57 +0000)] 
KVM: split out kvm_free_assigned_irq()

Split out the logic corresponding to undoing assign_irq() and
clean it up a bit.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: add KVM_USERSPACE_IRQ_SOURCE_ID assertions
Mark McLoughlin [Mon, 1 Dec 2008 13:57:48 +0000 (13:57 +0000)] 
KVM: add KVM_USERSPACE_IRQ_SOURCE_ID assertions

Make sure kvm_request_irq_source_id() never returns
KVM_USERSPACE_IRQ_SOURCE_ID.

Likewise, check that kvm_free_irq_source_id() never accepts
KVM_USERSPACE_IRQ_SOURCE_ID.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
15 years agoKVM: don't free an unallocated irq source id
Mark McLoughlin [Mon, 1 Dec 2008 13:57:47 +0000 (13:57 +0000)] 
KVM: don't free an unallocated irq source id

Set assigned_dev->irq_source_id to -1 so that we can avoid freeing
a source ID which we never allocated.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>