Skip to content
  1. Jun 06, 2012
  2. Jun 01, 2012
  3. May 31, 2012
  4. May 30, 2012
  5. May 29, 2012
    • Andrea Arcangeli's avatar
      mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition · 26c19178
      Andrea Arcangeli authored
      
      
      When holding the mmap_sem for reading, pmd_offset_map_lock should only
      run on a pmd_t that has been read atomically from the pmdp pointer,
      otherwise we may read only half of it leading to this crash.
      
      PID: 11679  TASK: f06e8000  CPU: 3   COMMAND: "do_race_2_panic"
       #0 [f06a9dd8] crash_kexec at c049b5ec
       #1 [f06a9e2c] oops_end at c083d1c2
       #2 [f06a9e40] no_context at c0433ded
       #3 [f06a9e64] bad_area_nosemaphore at c043401a
       #4 [f06a9e6c] __do_page_fault at c0434493
       #5 [f06a9eec] do_page_fault at c083eb45
       #6 [f06a9f04] error_code (via page_fault) at c083c5d5
          EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP:
          00000000
          DS:  007b     ESI: 9e201000 ES:  007b     EDI: 01fb4700 GS:  00e0
          CS:  0060     EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246
       #7 [f06a9f38] _spin_lock at c083bc14
       #8 [f06a9f44] sys_mincore at c0507b7d
       #9 [f06a9fb0] system_call at c083becd
                               start           len
          EAX: ffffffda  EBX: 9e200000  ECX: 00001000  EDX: 6228537f
          DS:  007b      ESI: 00000000  ES:  007b      EDI: 003d0f00
          SS:  007b      ESP: 62285354  EBP: 62285388  GS:  0033
          CS:  0073      EIP: 00291416  ERR: 000000da  EFLAGS: 00000286
      
      This should be a longstanding bug affecting x86 32bit PAE without THP.
      Only archs with 64bit large pmd_t and 32bit unsigned long should be
      affected.
      
      With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad()
      would partly hide the bug when the pmd transition from none to stable,
      by forcing a re-read of the *pmd in pmd_offset_map_lock, but when THP is
      enabled a new set of problem arises by the fact could then transition
      freely in any of the none, pmd_trans_huge or pmd_trans_stable states.
      So making the barrier in pmd_none_or_trans_huge_or_clear_bad()
      unconditional isn't good idea and it would be a flakey solution.
      
      This should be fully fixed by introducing a pmd_read_atomic that reads
      the pmd in order with THP disabled, or by reading the pmd atomically
      with cmpxchg8b with THP enabled.
      
      Luckily this new race condition only triggers in the places that must
      already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix
      is localized there but this bug is not related to THP.
      
      NOTE: this can trigger on x86 32bit systems with PAE enabled with more
      than 4G of ram, otherwise the high part of the pmd will never risk to be
      truncated because it would be zero at all times, in turn so hiding the
      SMP race.
      
      This bug was discovered and fully debugged by Ulrich, quote:
      
      ----
      [..]
      pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and
      eax.
      
          496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t
          *pmd)
          497 {
          498         /* depend on compiler for an atomic pmd read */
          499         pmd_t pmdval = *pmd;
      
                                      // edi = pmd pointer
      0xc0507a74 <sys_mincore+548>:   mov    0x8(%esp),%edi
      ...
                                      // edx = PTE page table high address
      0xc0507a84 <sys_mincore+564>:   mov    0x4(%edi),%edx
      ...
                                      // eax = PTE page table low address
      0xc0507a8e <sys_mincore+574>:   mov    (%edi),%eax
      
      [..]
      
      Please note that the PMD is not read atomically. These are two "mov"
      instructions where the high order bits of the PMD entry are fetched
      first. Hence, the above machine code is prone to the following race.
      
      -  The PMD entry {high|low} is 0x0000000000000000.
         The "mov" at 0xc0507a84 loads 0x00000000 into edx.
      
      -  A page fault (on another CPU) sneaks in between the two "mov"
         instructions and instantiates the PMD.
      
      -  The PMD entry {high|low} is now 0x00000003fda38067.
         The "mov" at 0xc0507a8e loads 0xfda38067 into eax.
      ----
      
      Reported-by: default avatarUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26c19178
  6. May 26, 2012
    • Linus Torvalds's avatar
      x86: use the new generic strnlen_user() function · 5723aa99
      Linus Torvalds authored
      
      
      This throws away the old x86-specific functions in favor of the generic
      optimized version.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5723aa99
    • Linus Torvalds's avatar
      word-at-a-time: make the interfaces truly generic · 36126f8f
      Linus Torvalds authored
      
      
      This changes the interfaces in <asm/word-at-a-time.h> to be a bit more
      complicated, but a lot more generic.
      
      In particular, it allows us to really do the operations efficiently on
      both little-endian and big-endian machines, pretty much regardless of
      machine details.  For example, if you can rely on a fast population
      count instruction on your architecture, this will allow you to make your
      optimized <asm/word-at-a-time.h> file with that.
      
      NOTE! The "generic" version in include/asm-generic/word-at-a-time.h is
      not truly generic, it actually only works on big-endian.  Why? Because
      on little-endian the generic algorithms are wasteful, since you can
      inevitably do better. The x86 implementation is an example of that.
      
      (The only truly non-generic part of the asm-generic implementation is
      the "find_zero()" function, and you could make a little-endian version
      of it.  And if the Kbuild infrastructure allowed us to pick a particular
      header file, that would be lovely)
      
      The <asm/word-at-a-time.h> functions are as follows:
      
       - WORD_AT_A_TIME_CONSTANTS: specific constants that the algorithm
         uses.
      
       - has_zero(): take a word, and determine if it has a zero byte in it.
         It gets the word, the pointer to the constant pool, and a pointer to
         an intermediate "data" field it can set.
      
         This is the "quick-and-dirty" zero tester: it's what is run inside
         the hot loops.
      
       - "prep_zero_mask()": take the word, the data that has_zero() produced,
         and the constant pool, and generate an *exact* mask of which byte had
         the first zero.  This is run directly *outside* the loop, and allows
         the "has_zero()" function to answer the "is there a zero byte"
         question without necessarily getting exactly *which* byte is the
         first one to contain a zero.
      
         If you do multiple byte lookups concurrently (eg "hash_name()", which
         looks for both NUL and '/' bytes), after you've done the prep_zero_mask()
         phase, the result of those can be or'ed together to get the "either
         or" case.
      
       - The result from "prep_zero_mask()" can then be fed into "find_zero()"
         (to find the byte offset of the first byte that was zero) or into
         "zero_bytemask()" (to find the bytemask of the bytes preceding the
         zero byte).
      
         The existence of zero_bytemask() is optional, and is not necessary
         for the normal string routines.  But dentry name hashing needs it, so
         if you enable DENTRY_WORD_AT_A_TIME you need to expose it.
      
      This changes the generic strncpy_from_user() function and the dentry
      hashing functions to use these modified word-at-a-time interfaces.  This
      gets us back to the optimized state of the x86 strncpy that we lost in
      the previous commit when moving over to the generic version.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36126f8f
    • Linus Torvalds's avatar
      x86: use generic strncpy_from_user routine · 4ae73f2d
      Linus Torvalds authored
      
      
      The generic strncpy_from_user() is not really optimal, since it is
      designed to work on both little-endian and big-endian.  And on
      little-endian you can simplify much of the logic to find the first zero
      byte, since little-endian arithmetic doesn't have to worry about the
      carry bit propagating into earlier bytes (only later bytes, which we
      don't care about).
      
      But I have patches to make the generic routines use the architecture-
      specific <asm/word-at-a-time.h> infrastructure, so that we can regain
      the little-endian optimizations.  But before we do that, switch over to
      the generic routines to make the patches each do just one well-defined
      thing.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ae73f2d
  7. May 24, 2012
  8. May 23, 2012
  9. May 22, 2012
  10. May 21, 2012
  11. May 18, 2012
  12. May 17, 2012
    • Paul Gortmaker's avatar
      MCA: delete all remaining traces of microchannel bus support. · bb8187d3
      Paul Gortmaker authored
      
      
      Hardware with MCA bus is limited to 386 and 486 class machines
      that are now 20+ years old and typically with less than 32MB
      of memory.  A quick search on the internet, and you see that
      even the MCA hobbyist/enthusiast community has lost interest
      in the early 2000 era and never really even moved ahead from
      the 2.4 kernels to the 2.6 series.
      
      This deletes anything remaining related to CONFIG_MCA from core
      kernel code and from the x86 architecture.  There is no point in
      carrying this any further into the future.
      
      One complication to watch for is inadvertently scooping up
      stuff relating to machine check, since there is overlap in
      the TLA name space (e.g. arch/x86/boot/mca.c).
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: x86@kernel.org
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      bb8187d3
  13. May 16, 2012
    • Suresh Siddha's avatar
      fork: move the real prepare_to_copy() users to arch_dup_task_struct() · 55ccf3fe
      Suresh Siddha authored
      
      
      Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
      the architectures and the rest following the x86 model of flushing the extended
      register state like fpu there.
      
      Remove it and use the arch_dup_task_struct() instead.
      
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSuresh Siddha <suresh.b.siddha@intel.com>
      Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
      
      
      Acked-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      55ccf3fe
    • H. Peter Anvin's avatar
      x86, realmode: Change EFER to a single u64 field · 638d957b
      H. Peter Anvin authored
      
      
      Change EFER to be a single u64 field instead of two u32 fields; change
      the order to maintain alignment.  Note that on x86-64 cr4 is really
      also a 64-bit quantity, although we can only set the low 32 bits from
      the trampoline code since it is still executing in 32-bit mode at that
      point.
      
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
      638d957b
    • Avi Kivity's avatar
      KVM: MMU: Don't use RCU for lockless shadow walking · c142786c
      Avi Kivity authored
      
      
      Using RCU for lockless shadow walking can increase the amount of memory
      in use by the system, since RCU grace periods are unpredictable.  We also
      have an unconditional write to a shared variable (reader_counter), which
      isn't good for scaling.
      
      Replace that with a scheme similar to x86's get_user_pages_fast(): disable
      interrupts during lockless shadow walk to force the freer
      (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the
      processor with interrupts enabled.
      
      We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent
      kvm_flush_remote_tlbs() from avoiding the IPI.
      
      Signed-off-by: default avatarAvi Kivity <avi@redhat.com>
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      c142786c
  14. May 14, 2012
  15. May 12, 2012
  16. May 09, 2012
  17. May 08, 2012
Loading