Skip to content
  1. Mar 27, 2013
  2. Mar 26, 2013
    • Al Viro's avatar
      Nest rename_lock inside vfsmount_lock · 7ea600b5
      Al Viro authored
      
      
      ... lest we get livelocks between path_is_under() and d_path() and friends.
      
      The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
      it is possible to have thread B spin on attempt to take lock shared while thread
      A is already holding it shared, if B is on lower-numbered CPU than A and there's
      a thread C spinning on attempt to take the same lock exclusive.
      
      As the result, we need consistent ordering between vfsmount_lock (lglock) and
      rename_lock (seq_lock), even though everything that takes both is going to take
      vfsmount_lock only shared.
      
      Spotted-by: default avatarBrad Spengler <spender@grsecurity.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      7ea600b5
  3. Mar 22, 2013
    • Kent Overstreet's avatar
      nfsd: fix bad offset use · e49dbbf3
      Kent Overstreet authored
      
      
      vfs_writev() updates the offset argument - but the code then passes the
      offset to vfs_fsync_range(). Since offset now points to the offset after
      what was just written, this is probably not what was intended
      
      Introduced by face1502 "nfsd: use
      vfs_fsync_range(), not O_SYNC, for stable writes".
      
      Signed-off-by: default avatarKent Overstreet <koverstreet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarZach Brown <zab@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e49dbbf3
    • Linus Torvalds's avatar
      vfs,proc: guarantee unique inodes in /proc · 51f0885e
      Linus Torvalds authored
      
      
      Dave Jones found another /proc issue with his Trinity tool: thanks to
      the namespace model, we can have multiple /proc dentries that point to
      the same inode, aliasing directories in /proc/<pid>/net/ for example.
      
      This ends up being a total disaster, because it acts like hardlinked
      directories, and causes locking problems.  We rely on the topological
      sort of the inodes pointed to by dentries, and if we have aliased
      directories, that odering becomes unreliable.
      
      In short: don't do this.  Multiple dentries with the same (directory)
      inode is just a bad idea, and the namespace code should never have
      exposed things this way.  But we're kind of stuck with it.
      
      This solves things by just always allocating a new inode during /proc
      dentry lookup, instead of using "iget_locked()" to look up existing
      inodes by superblock and number.  That actually simplies the code a bit,
      at the cost of potentially doing more inode [de]allocations.
      
      That said, the inode lookup wasn't free either (and did a lot of locking
      of inodes), so it is probably not that noticeable.  We could easily keep
      the old lookup model for non-directory entries, but rather than try to
      be excessively clever this just implements the minimal and simplest
      workaround for the problem.
      
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Analyzed-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51f0885e
  4. Mar 21, 2013
  5. Mar 20, 2013
    • Trond Myklebust's avatar
      NFSv4: Fix the string length returned by the idmapper · cf4ab538
      Trond Myklebust authored
      
      
      Functions like nfs_map_uid_to_name() and nfs_map_gid_to_group() are
      expected to return a string without any terminating NUL character.
      Regression introduced by commit 57e62324
      (NFS: Store the legacy idmapper result in the keyring).
      
      Reported-by: default avatarDave Chiluk <dave.chiluk@canonical.com>
      Signed-off-by: default avatarTrond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Bryan Schumaker <bjschuma@netapp.com>
      Cc: stable@vger.kernel.org [>=3.4]
      cf4ab538
    • Theodore Ts'o's avatar
      ext4: fix data=journal fast mount/umount hang · 2b405bfa
      Theodore Ts'o authored
      
      
      In data=journal mode, if we unmount the file system before a
      transaction has a chance to complete, when the journal inode is being
      evicted, we can end up calling into jbd2_log_wait_commit() for the
      last transaction, after the journalling machinery has been shut down.
      
      Arguably we should adjust ext4_should_journal_data() to return FALSE
      for the journal inode, but the only place it matters is
      ext4_evict_inode(), and so to save a bit of CPU time, and to make the
      patch much more obviously correct by inspection(tm), we'll fix it by
      explicitly not trying to waiting for a journal commit when we are
      evicting the journal inode, since it's guaranteed to never succeed in
      this case.
      
      This can be easily replicated via: 
      
           mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb
      
      ------------[ cut here ]------------
      WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
      Hardware name: Bochs
      JBD2: bad log_start_commit: 3005630206 3005630206 0 0
      Modules linked in:
      Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
      Call Trace:
       [<c015c0ef>] warn_slowpath_common+0x68/0x7d
       [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd
       [<c015c177>] warn_slowpath_fmt+0x2b/0x2f
       [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd
       [<c02b8075>] jbd2_log_start_commit+0x24/0x34
       [<c0279ed5>] ext4_evict_inode+0x71/0x2e3
       [<c021f0ec>] evict+0x94/0x135
       [<c021f9aa>] iput+0x10a/0x110
       [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce
       [<c0175284>] ? bit_waitqueue+0x50/0x50
       [<c028d23f>] ext4_put_super+0x52/0x294
       [<c020efe3>] generic_shutdown_super+0x48/0xb4
       [<c020f071>] kill_block_super+0x22/0x60
       [<c020f3e0>] deactivate_locked_super+0x22/0x49
       [<c020f5d6>] deactivate_super+0x30/0x33
       [<c0222795>] mntput_no_expire+0x107/0x10c
       [<c02233a7>] sys_umount+0x2cf/0x2e0
       [<c02233ca>] sys_oldumount+0x12/0x14
       [<c08096b8>] syscall_call+0x7/0xb
      ---[ end trace 6a954cc790501c1f ]---
      jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      2b405bfa
    • Theodore Ts'o's avatar
      ext4: fix ext4_evict_inode() racing against workqueue processing code · 1ada47d9
      Theodore Ts'o authored
      
      
      Commit 84c17543 (ext4: move work from io_end to inode) triggered a
      regression when running xfstest #270 when the file system is mounted
      with dioread_nolock.
      
      The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
      this guarantees that last io_end structure has been freed, but it does
      not guarantee that the workqueue structure, which was moved into the
      inode by commit 84c17543, is actually finished.  Once
      ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
      will allow ext4_ioend_wait() to return on CPU #2, at which point the
      evict_inode() codepath can race against the workqueue code on CPU #1
      accessing EXT4_I(inode)->i_unwritten_work to find the next item of
      work to do.
      
      Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
      will be renamed ext4_ioend_shutdown(), since it is only used by
      ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
      until after truncate_inode_pages() and filemap_write_and_wait() are
      called, to make sure all dirty pages have been written back and
      flushed from the page cache first.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
      IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
      *pdpt = 0000000030bc3001 *pde = 0000000000000000 
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in:
      Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs
      EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
      EIP is at cwq_activate_delayed_work+0x3b/0x7e
      EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
      ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
      Stack:
       f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
       f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
       f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
      Call Trace:
       [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
       [<c01def1d>] process_one_work+0x5d8/0x637
       [<c040a10b>] ? ext4_end_bio+0x300/0x300
       [<c01e3105>] worker_thread+0x249/0x3ef
       [<c01ea317>] kthread+0xd8/0xeb
       [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
       [<c023a370>] ? trace_hardirqs_on+0x27/0x37
       [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
       [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
      Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
       6c c1 00 31 c9 83
      EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
      CR2: 0000000000000000
      ---[ end trace a1923229da53d8a4 ]---
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      1ada47d9
  6. Mar 18, 2013
  7. Mar 16, 2013
    • Liu Bo's avatar
      Btrfs: fix warning of free_extent_map · 3b277594
      Liu Bo authored
      
      
      Users report that an extent map's list is still linked when it's actually
      going to be freed from cache.
      
      The story is that
      
      a) when we're going to drop an extent map and may split this large one into
      smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means
      that it's on the list to be logged, then the smaller ems split from it will also
      be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected.
      
      b) we'll keep ems from unlinking the list and freeing when they are flagged with
      EXTENT_FLAG_LOGGING, because the log code holds one reference.
      
      The end result is the warning, but the truth is that we set the flag
      EXTENT_FLAG_LOGGING only during fsync.
      
      So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one.
      
      Reported-by: default avatarJohannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Reported-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: default avatarChris Mason <chris.mason@fusionio.com>
      3b277594
  8. Mar 14, 2013
  9. Mar 13, 2013
  10. Mar 12, 2013
    • Mathieu Desnoyers's avatar
      Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys · 8aec0f5d
      Mathieu Desnoyers authored
      
      
      Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
      compat_process_vm_rw() shows that the compatibility code requires an
      explicit "access_ok()" check before calling
      compat_rw_copy_check_uvector(). The same difference seems to appear when
      we compare fs/read_write.c:do_readv_writev() to
      fs/compat.c:compat_do_readv_writev().
      
      This subtle difference between the compat and non-compat requirements
      should probably be debated, as it seems to be error-prone. In fact,
      there are two others sites that use this function in the Linux kernel,
      and they both seem to get it wrong:
      
      Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
      also ends up calling compat_rw_copy_check_uvector() through
      aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
      be missing. Same situation for
      security/keys/compat.c:compat_keyctl_instantiate_key_iov().
      
      I propose that we add the access_ok() check directly into
      compat_rw_copy_check_uvector(), so callers don't have to worry about it,
      and it therefore makes the compat call code similar to its non-compat
      counterpart. Place the access_ok() check in the same location where
      copy_from_user() can trigger a -EFAULT error in the non-compat code, so
      the ABI behaviors are alike on both compat and non-compat.
      
      While we are here, fix compat_do_readv_writev() so it checks for
      compat_rw_copy_check_uvector() negative return values.
      
      And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
      handling.
      
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aec0f5d
    • Lukas Czerner's avatar
      ext4: use s_extent_max_zeroout_kb value as number of kb · 4f42f80a
      Lukas Czerner authored
      
      
      Currently when converting extent to initialized, we have to decide
      whether to zeroout part/all of the uninitialized extent in order to
      avoid extent tree growing rapidly.
      
      The decision is made by comparing the size of the extent with the
      configurable value s_extent_max_zeroout_kb which is in kibibytes units.
      
      However when converting it to number of blocks we currently use it as it
      was in bytes. This is obviously bug and it will result in ext4 _never_
      zeroout extents, but rather always split and convert parts to
      initialized while leaving the rest uninitialized in default setting.
      
      Fix this by using s_extent_max_zeroout_kb as kibibytes.
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4f42f80a
    • Al Viro's avatar
      vfs: fix pipe counter breakage · a930d879
      Al Viro authored
      
      
      If you open a pipe for neither read nor write, the pipe code will not
      add any usage counters to the pipe, causing the 'struct pipe_inode_info"
      to be potentially released early.
      
      That doesn't normally matter, since you cannot actually use the pipe,
      but the pipe release code - particularly fasync handling - still expects
      the actual pipe infrastructure to all be there.  And rather than adding
      NULL pointer checks, let's just disallow this case, the same way we
      already do for the named pipe ("fifo") case.
      
      This is ancient going back to pre-2.4 days, and until trinity, nobody
      naver noticed.
      
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a930d879
    • Theodore Ts'o's avatar
      ext4: use atomic64_t for the per-flexbg free_clusters count · 90ba983f
      Theodore Ts'o authored
      A user who was using a 8TB+ file system and with a very large flexbg
      size (> 65536) could cause the atomic_t used in the struct flex_groups
      to overflow.  This was detected by PaX security patchset:
      
      http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551
      
      
      
      This bug was introduced in commit 9f24e420, so it's been around
      since 2.6.30.  :-(
      
      Fix this by using an atomic64_t for struct orlav_stats's
      free_clusters.
      
      Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: default avatarLukas Czerner <lczerner@redhat.com>
      Cc: stable@vger.kernel.org
      90ba983f
  11. Mar 11, 2013
Loading