Skip to content
  1. Jan 25, 2014
    • Oleg Nesterov's avatar
      introduce __fcheck_files() to fix rcu_dereference_check_fdtable(), kill rcu_my_thread_group_empty() · a8d4b834
      Oleg Nesterov authored
      
      
      rcu_dereference_check_fdtable() looks very wrong,
      
      1. rcu_my_thread_group_empty() was added by 844b9a87 "vfs: fix
         RCU-lockdep false positive due to /proc" but it doesn't really
         fix the problem. A CLONE_THREAD (without CLONE_FILES) task can
         hit the same race with get_files_struct().
      
         And otoh rcu_my_thread_group_empty() can suppress the correct
         warning if the caller is the CLONE_FILES (without CLONE_THREAD)
         task.
      
      2. files->count == 1 check is not really right too. Even if this
         files_struct is not shared it is not safe to access it lockless
         unless the caller is the owner.
      
         Otoh, this check is sub-optimal. files->count == 0 always means
         it is safe to use it lockless even if files != current->files,
         but put_files_struct() has to take rcu_read_lock(). See the next
         patch.
      
      This patch removes the buggy checks and turns fcheck_files() into
      __fcheck_files() which uses rcu_dereference_raw(), the "unshared"
      callers, fget_light() and fget_raw_light(), can use it to avoid
      the warning from RCU-lockdep.
      
      fcheck_files() is trivially reimplemented as rcu_lockdep_assert()
      plus __fcheck_files().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a8d4b834
  2. Jan 12, 2014
  3. Dec 21, 2013
    • Masami Ichikawa's avatar
      PM / sleep: Fix memory leak in pm_vt_switch_unregister(). · c6068504
      Masami Ichikawa authored
      
      
      kmemleak reported a memory leak as below.
      
      unreferenced object 0xffff880118f14700 (size 32):
        comm "swapper/0", pid 1, jiffies 4294877401 (age 123.283s)
        hex dump (first 32 bytes):
          00 01 10 00 00 00 ad de 00 02 20 00 00 00 ad de  .......... .....
          00 d4 d2 18 01 88 ff ff 01 00 00 00 00 04 00 00  ................
        backtrace:
          [<ffffffff814edb1e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811889dc>] kmem_cache_alloc_trace+0x1ec/0x260
          [<ffffffff810aba66>] pm_vt_switch_required+0x76/0xb0
          [<ffffffff812f39f5>] register_framebuffer+0x195/0x320
          [<ffffffff8130af18>] efifb_probe+0x718/0x780
          [<ffffffff81391495>] platform_drv_probe+0x45/0xb0
          [<ffffffff8138f407>] driver_probe_device+0x87/0x3a0
          [<ffffffff8138f7f3>] __driver_attach+0x93/0xa0
          [<ffffffff8138d413>] bus_for_each_dev+0x63/0xa0
          [<ffffffff8138ee5e>] driver_attach+0x1e/0x20
          [<ffffffff8138ea40>] bus_add_driver+0x180/0x250
          [<ffffffff8138fe74>] driver_register+0x64/0xf0
          [<ffffffff813913ba>] __platform_driver_register+0x4a/0x50
          [<ffffffff8191e028>] efifb_driver_init+0x12/0x14
          [<ffffffff8100214a>] do_one_initcall+0xfa/0x1b0
          [<ffffffff818e40e0>] kernel_init_freeable+0x17b/0x201
      
      In pm_vt_switch_required(), "entry" variable is allocated via kmalloc().
      So, in pm_vt_switch_unregister(), it needs to call kfree() when object
      is deleted from list.
      
      Signed-off-by: default avatarMasami Ichikawa <masami256@gmail.com>
      Reviewed-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c6068504
  4. Dec 20, 2013
  5. Dec 19, 2013
    • Tejun Heo's avatar
      libata, freezer: avoid block device removal while system is frozen · 85fbd722
      Tejun Heo authored
      
      
      Freezable kthreads and workqueues are fundamentally problematic in
      that they effectively introduce a big kernel lock widely used in the
      kernel and have already been the culprit of several deadlock
      scenarios.  This is the latest occurrence.
      
      During resume, libata rescans all the ports and revalidates all
      pre-existing devices.  If it determines that a device has gone
      missing, the device is removed from the system which involves
      invalidating block device and flushing bdi while holding driver core
      layer locks.  Unfortunately, this can race with the rest of device
      resume.  Because freezable kthreads and workqueues are thawed after
      device resume is complete and block device removal depends on
      freezable workqueues and kthreads (e.g. bdi_wq, jbd2) to make
      progress, this can lead to deadlock - block device removal can't
      proceed because kthreads are frozen and kthreads can't be thawed
      because device resume is blocked behind block device removal.
      
      839a8e86 ("writeback: replace custom worker pool implementation
      with unbound workqueue") made this particular deadlock scenario more
      visible but the underlying problem has always been there - the
      original forker task and jbd2 are freezable too.  In fact, this is
      highly likely just one of many possible deadlock scenarios given that
      freezer behaves as a big kernel lock and we don't have any debug
      mechanism around it.
      
      I believe the right thing to do is getting rid of freezable kthreads
      and workqueues.  This is something fundamentally broken.  For now,
      implement a funny workaround in libata - just avoid doing block device
      hot[un]plug while the system is frozen.  Kernel engineering at its
      finest.  :(
      
      v2: Add EXPORT_SYMBOL_GPL(pm_freezing) for cases where libata is built
          as a module.
      
      v3: Comment updated and polling interval changed to 10ms as suggested
          by Rafael.
      
      v4: Add #ifdef CONFIG_FREEZER around the hack as pm_freezing is not
          defined when FREEZER is not configured thus breaking build.
          Reported by kbuild test robot.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarTomaž Šolc <tomaz.solc@tablix.org>
      Reviewed-by: default avatar"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=62801
      Link: http://lkml.kernel.org/r/20131213174932.GA27070@htj.dyndns.org
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: kbuild test robot <fengguang.wu@intel.com>
      85fbd722
    • Rik van Riel's avatar
      mm: fix TLB flush race between migration, and change_protection_range · 20841405
      Rik van Riel authored
      
      
      There are a few subtle races, between change_protection_range (used by
      mprotect and change_prot_numa) on one side, and NUMA page migration and
      compaction on the other side.
      
      The basic race is that there is a time window between when the PTE gets
      made non-present (PROT_NONE or NUMA), and the TLB is flushed.
      
      During that time, a CPU may continue writing to the page.
      
      This is fine most of the time, however compaction or the NUMA migration
      code may come in, and migrate the page away.
      
      When that happens, the CPU may continue writing, through the cached
      translation, to what is no longer the current memory location of the
      process.
      
      This only affects x86, which has a somewhat optimistic pte_accessible.
      All other architectures appear to be safe, and will either always flush,
      or flush whenever there is a valid mapping, even with no permissions
      (SPARC).
      
      The basic race looks like this:
      
      CPU A			CPU B			CPU C
      
      						load TLB entry
      make entry PTE/PMD_NUMA
      			fault on entry
      						read/write old page
      			start migrating page
      			change PTE/PMD to new page
      						read/write old page [*]
      flush TLB
      						reload TLB from new entry
      						read/write new page
      						lose data
      
      [*] the old page may belong to a new user at this point!
      
      The obvious fix is to flush remote TLB entries, by making sure that
      pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
      still be accessible if there is a TLB flush pending for the mm.
      
      This should fix both NUMA migration and compaction.
      
      [mgorman@suse.de: fix build]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20841405
    • Mel Gorman's avatar
      sched: numa: skip inaccessible VMAs · 3c67f474
      Mel Gorman authored
      
      
      Inaccessible VMA should not be trapping NUMA hint faults. Skip them.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c67f474
    • Vivek Goyal's avatar
      kexec: migrate to reboot cpu · c97102ba
      Vivek Goyal authored
      
      
      Commit 1b3a5d02 ("reboot: move arch/x86 reboot= handling to generic
      kernel") moved reboot= handling to generic code.  In the process it also
      removed the code in native_machine_shutdown() which are moving reboot
      process to reboot_cpu/cpu0.
      
      I guess that thought must have been that all reboot paths are calling
      migrate_to_reboot_cpu(), so we don't need this special handling.  But
      kexec reboot path (kernel_kexec()) is not calling
      migrate_to_reboot_cpu() so above change broke kexec.  Now reboot can
      happen on non-boot cpu and when INIT is sent in second kerneo to bring
      up BP, it brings down the machine.
      
      So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
      this problem.
      
      Bisected by WANG Chao.
      
      Reported-by: default avatarMatthew Whitehead <mwhitehe@redhat.com>
      Reported-by: default avatarDave Young <dyoung@redhat.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Tested-by: default avatarBaoquan He <bhe@redhat.com>
      Tested-by: default avatarWANG Chao <chaowang@redhat.com>
      Acked-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c97102ba
  6. Dec 17, 2013
    • Kirill Tkhai's avatar
      sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities · 757dfcaa
      Kirill Tkhai authored
      
      
      This patch touches the RT group scheduling case.
      
      Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
      priority, while rt_rq passed to them may be not the top-level rt_rq.
      This is wrong, because changing of priority on a child level does not
      guarantee that the priority is the highest all over the rq. So, this
      leak makes RT balancing unusable.
      
      The short example: the task having the highest priority among all rq's
      RT tasks (no one other task has the same priority) are waking on a
      throttle rt_rq.  The rq's cpupri is set to the task's priority
      equivalent, but real rq->rt.highest_prio.curr is less.
      
      The patch below fixes the problem.
      
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ru
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      757dfcaa
    • Mel Gorman's avatar
      sched: Assign correct scheduling domain to 'sd_llc' · 5d4cf996
      Mel Gorman authored
      
      
      Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
      dereference on sd_busy but the fix also altered what scheduling domain it
      used for the 'sd_llc' percpu variable.
      
      One impact of this is that a task selecting a runqueue may consider
      idle CPUs that are not cache siblings as candidates for running.
      Tasks are then running on CPUs that are not cache hot.
      
      This was found through bisection where ebizzy threads were not seeing equal
      performance and it looked like a scheduling fairness issue. This patch
      mitigates but does not completely fix the problem on all machines tested
      implying there may be an additional bug or a common root cause. Here are
      the average range of performance seen by individual ebizzy threads. It
      was tested on top of candidate patches related to x86 TLB range flushing.
      
      	4-core machine
      			    3.13.0-rc3            3.13.0-rc3
      			       vanilla            fixsd-v3r3
      	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
      	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
      	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
      	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
      	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
      	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
      	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)
      
      	8-core machine
      	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
      	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
      	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
      	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
      	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
      	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
      	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
      	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
      	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
      	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)
      
      It's eliminated for one machine and reduced for another.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: H Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5d4cf996
    • Alexander Shishkin's avatar
      perf: Disable all pmus on unthrottling and rescheduling · 44377277
      Alexander Shishkin authored
      
      
      Currently, only one PMU in a context gets disabled during unthrottling
      and event_sched_{out,in}(), however, events in one context may belong to
      different pmus, which results in PMUs being reprogrammed while they are
      still enabled.
      
      This means that mixed PMU use [which is rare in itself] resulted in
      potentially completely unreliable results: corrupted events, bogus
      results, etc.
      
      This patch temporarily disables PMUs that correspond to
      each event in the context while these events are being modified.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Reviewed-by: default avatarAndi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Link: http://lkml.kernel.org/r/1387196256-8030-1-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      44377277
    • Li Zefan's avatar
      cgroup: don't recycle cgroup id until all csses' have been destroyed · c1a71504
      Li Zefan authored
      
      
      Hugh reported this bug:
      
      > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
      >
      > mkdir -p /tmp/tmpfs /tmp/memcg
      > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
      > mount -t cgroup -o memory memcg /tmp/memcg
      > mkdir /tmp/memcg/old
      > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
      > echo $$ >/tmp/memcg/old/tasks
      > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
      > echo $$ >/tmp/memcg/tasks
      > rmdir /tmp/memcg/old
      > sleep 1	# let rmdir work complete
      > mkdir /tmp/memcg/new
      > umount /tmp/tmpfs
      > dmesg | grep WARNING
      > rmdir /tmp/memcg/new
      > umount /tmp/memcg
      >
      > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
      >                            res_counter_uncharge_locked+0x1f/0x2f()
      >
      > Breakage comes from 34c00c31 ("memcg: convert to use cgroup id").
      >
      > The lifetime of a cgroup id is different from the lifetime of the
      > css id it replaced: memsw's css_get()s do nothing to hold on to the
      > old cgroup id, it soon gets recycled to a new cgroup, which then
      > mysteriously inherits the old's swap, without any charge for it.
      
      Instead of removing cgroup id right after all the csses have been
      offlined, we should do that after csses have been destroyed.
      
      To make sure an invalid css pointer won't be returned after the css
      is destroyed, make sure css_from_id() returns NULL in this case.
      
      tj: Updated comment to note planned changes for cgrp->id.
      
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLi Zefan <lizefan@huawei.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      c1a71504
  7. Dec 16, 2013
    • Miao Xie's avatar
      ftrace: Initialize the ftrace profiler for each possible cpu · c4602c1c
      Miao Xie authored
      Ftrace currently initializes only the online CPUs. This implementation has
      two problems:
      - If we online a CPU after we enable the function profile, and then run the
        test, we will lose the trace information on that CPU.
        Steps to reproduce:
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # echo 1 > /sys/devices/system/cpu/cpu1/online
        # run test
      - If we offline a CPU before we enable the function profile, we will not clear
        the trace information when we enable the function profile. It will trouble
        the users.
        Steps to reproduce:
        # cd <debugfs>/tracing/
        # echo <some function name> >> set_ftrace_filter
        # echo 1 > function_profile_enabled
        # run test
        # cat trace_stat/function*
        # echo 0 > /sys/devices/system/cpu/cpu1/online
        # echo 0 > function_profile_enabled
        # echo 1 > function_profile_enabled
        # cat trace_stat/function*
        # run test
        # cat trace_stat/function*
      
      So it is better that we initialize the ftrace profiler for each possible cpu
      every time we enable the function profile instead of just the online ones.
      
      Link: http://lkml.kernel.org/r/1387178401-10619-1-git-send-email-miaox@cn.fujitsu.com
      
      
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      c4602c1c
  8. Dec 13, 2013
    • Xiao Guangrong's avatar
      KEYS: fix uninitialized persistent_keyring_register_sem · 6bd364d8
      Xiao Guangrong authored
      
      
      We run into this bug:
      [ 2736.063245] Unable to handle kernel paging request for data at address 0x00000000
      [ 2736.063293] Faulting instruction address: 0xc00000000037efb0
      [ 2736.063300] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 2736.063303] SMP NR_CPUS=2048 NUMA pSeries
      [ 2736.063310] Modules linked in: sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6table_security ip6table_raw ip6t_REJECT iptable_nat nf_nat_ipv4 iptable_mangle iptable_security iptable_raw ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ebtable_filter ebtables ip6table_filter iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6_tables ibmveth pseries_rng nx_crypto nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc xfs libcrc32c dm_service_time sd_mod crc_t10dif crct10dif_common ibmvfc scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_multipath dm_mod
      [ 2736.063383] CPU: 1 PID: 7128 Comm: ssh Not tainted 3.10.0-48.el7.ppc64 #1
      [ 2736.063389] task: c000000131930120 ti: c0000001319a0000 task.ti: c0000001319a0000
      [ 2736.063394] NIP: c00000000037efb0 LR: c0000000006c40f8 CTR: 0000000000000000
      [ 2736.063399] REGS: c0000001319a3870 TRAP: 0300   Not tainted  (3.10.0-48.el7.ppc64)
      [ 2736.063403] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 28824242  XER: 20000000
      [ 2736.063415] SOFTE: 0
      [ 2736.063418] CFAR: c00000000000908c
      [ 2736.063421] DAR: 0000000000000000, DSISR: 40000000
      [ 2736.063425]
      GPR00: c0000000006c40f8 c0000001319a3af0 c000000001074788 c0000001319a3bf0
      GPR04: 0000000000000000 0000000000000000 0000000000000020 000000000000000a
      GPR08: fffffffe00000002 00000000ffff0000 0000000080000001 c000000000924888
      GPR12: 0000000028824248 c000000007e00400 00001fffffa0f998 0000000000000000
      GPR16: 0000000000000022 00001fffffa0f998 0000010022e92470 0000000000000000
      GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      GPR24: 0000000000000000 c000000000f4a828 00003ffffe527108 0000000000000000
      GPR28: c000000000f4a730 c000000000f4a828 0000000000000000 c0000001319a3bf0
      [ 2736.063498] NIP [c00000000037efb0] .__list_add+0x30/0x110
      [ 2736.063504] LR [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063508] PACATMSCRATCH [800000000280f032]
      [ 2736.063511] Call Trace:
      [ 2736.063516] [c0000001319a3af0] [c0000001319a3b80] 0xc0000001319a3b80 (unreliable)
      [ 2736.063523] [c0000001319a3b80] [c0000000006c40f8] .rwsem_down_write_failed+0x78/0x264
      [ 2736.063530] [c0000001319a3c50] [c0000000006c1bb0] .down_write+0x70/0x78
      [ 2736.063536] [c0000001319a3cd0] [c0000000002e5ffc] .keyctl_get_persistent+0x20c/0x320
      [ 2736.063542] [c0000001319a3dc0] [c0000000002e2388] .SyS_keyctl+0x238/0x260
      [ 2736.063548] [c0000001319a3e30] [c000000000009e7c] syscall_exit+0x0/0x7c
      [ 2736.063553] Instruction dump:
      [ 2736.063556] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7cbd2b78 7c9e2378 7c7f1b78 f8010010
      [ 2736.063566] f821ff71 e8a50008 7fa52040 40de00c0 <e8be0000> 7fbd2840 40de0094 7fbff040
      [ 2736.063579] ---[ end trace 2708241785538296 ]---
      
      It's caused by uninitialized persistent_keyring_register_sem.
      
      The bug was introduced by commit f36f8c75, two typos are in that commit:
      CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS and
      krb_cache_register_sem should be persistent_keyring_register_sem.
      
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      6bd364d8
    • Kirill Tkhai's avatar
      KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y · f46a3cbb
      Kirill Tkhai authored
      
      
      Always remove generated SYSTEM_TRUSTED_KEYRING files while doing make mrproper.
      
      Signed-off-by: default avatarKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      f46a3cbb
    • David Howells's avatar
      X.509: Fix certificate gathering · d7ec435f
      David Howells authored
      
      
      Fix the gathering of certificates from both the source tree and the build tree
      to correctly calculate the pathnames of all the certificates.
      
      The problem was that if the default generated cert, signing_key.x509, didn't
      exist then it would not have a path attached and if it did, it would have a
      path attached.
      
      This means that the contents of kernel/.x509.list would change between the
      first compilation in a directory and the second.  After the second it would
      remain stable because the signing_key.x509 file exists.
      
      The consequence was that the kernel would get relinked unconditionally on the
      second recompilation.  The second recompilation would also show something like
      this:
      
         X.509 certificate list changed
           CERTS   kernel/x509_certificate_list
           - Including cert /home/torvalds/v2.6/linux/signing_key.x509
           AS      kernel/system_certificates.o
           LD      kernel/built-in.o
      
      which is why the relink would happen.
      
      
      Unfortunately, it isn't a simple matter of just sticking a path on the front
      of the filename of the certificate in the build directory as make can't then
      work out how to build it.
      
      So the path has to be prepended to the name for sorting and duplicate
      elimination and then removed for the make rule if it is in the build tree.
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      d7ec435f
  9. Dec 12, 2013
    • Linus Torvalds's avatar
      futex: move user address verification up to common code · 5cdec2d8
      Linus Torvalds authored
      
      
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cdec2d8
    • Linus Torvalds's avatar
      futex: fix handling of read-only-mapped hugepages · f12d5bfc
      Linus Torvalds authored
      
      
      The hugepage code had the exact same bug that regular pages had in
      commit 7485d0d3 ("futexes: Remove rw parameter from
      get_futex_key()").
      
      The regular page case was fixed by commit 9ea71503 ("futex: Fix
      regression with read only mappings"), but the transparent hugepage case
      (added in a5b338f2: "thp: update futex compound knowledge") case
      remained broken.
      
      Found by Dave Jones and his trinity tool.
      
      Reported-and-tested-by: default avatarDave Jones <davej@fedoraproject.org>
      Cc: stable@kernel.org # v2.6.38+
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f12d5bfc
  10. Dec 11, 2013
    • Peter Zijlstra's avatar
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra authored
      
      
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      
      Reported-and-Tested-by: default avatarChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9dbdb155
    • Peter Zijlstra's avatar
      sched: Initialize power_orig for overlapping groups · 8e8339a3
      Peter Zijlstra authored
      
      
      Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
      Make sure to always initialize power_orig now that we actually use it.
      
      Ideally build_sched_domains() -> init_sched_groups_power() would also
      initialize this; but for some yet unexplained reason some setups seem
      to miss updates there.
      
      Reported-by: default avatarYinghai Lu <yinghai@kernel.org>
      Tested-by: default avatarYinghai Lu <yinghai@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8e8339a3
  11. Dec 10, 2013
    • Hendrik Brueckner's avatar
      KEYS: correct alignment of system_certificate_list content in assembly file · 62226983
      Hendrik Brueckner authored
      
      
      Apart from data-type specific alignment constraints, there are also
      architecture-specific alignment requirements.
      For example, on s390 symbols must be on even addresses implying a 2-byte
      alignment.  If the system_certificate_list_end symbol is on an odd address
      and if this address is loaded, the least-significant bit is ignored.  As a
      result, the load_system_certificate_list() fails to load the certificates
      because of a wrong certificate length calculation.
      
      To be safe, align system_certificate_list on an 8-byte boundary.  Also improve
      the length calculation of the system_certificate_list content.  Introduce a
      system_certificate_list_size (8-byte aligned because of unsigned long) variable
      that stores the length.  Let the linker calculate this size by introducing
      a start and end label for the certificate content.
      
      Signed-off-by: default avatarHendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      62226983
    • Rusty Russell's avatar
      Ignore generated file kernel/x509_certificate_list · 7cfe5b33
      Rusty Russell authored
      
      
      $ git status
      # On branch pending-rebases
      # Untracked files:
      #   (use "git add <file>..." to include in what will be committed)
      #
      #	kernel/x509_certificate_list
      nothing added to commit but untracked files present (use "git add" to track)
      $
      
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      7cfe5b33
  12. Dec 07, 2013
  13. Dec 06, 2013
    • Tejun Heo's avatar
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo authored
      
      
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Reported-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  14. Dec 05, 2013
  15. Nov 29, 2013
  16. Nov 28, 2013
  17. Nov 27, 2013
    • Tejun Heo's avatar
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo authored
      
      
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • Peter Zijlstra's avatar
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra authored
      
      
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Tested-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
    • Thomas Gleixner's avatar
      sched: Expose preempt_schedule_irq() · 32e475d7
      Thomas Gleixner authored
      
      
      Tony reported that aa0d5326 ("ia64: Use preempt_schedule_irq")
      broke PREEMPT=n builds on ia64.
      
      Ok, wrapped my brain around it. I tripped over the magic asm foo which
      has a single need_resched check and schedule point for both sys call
      return and interrupt return.
      
      So you need the schedule_preempt_irq() for kernel preemption from
      interrupt return while on a normal syscall preemption a schedule would
      be sufficient. But using schedule_preempt_irq() is not harmful here in
      any way. It just sets the preempt_active bit also in cases where it
      would not be required.
      
      Even on preempt=n kernels adding the preempt_active bit is completely
      harmless. So instead of having an extra function, moving the existing
      one out of the ifdef PREEMPT looks like the sanest thing to do.
      
      It would also allow getting rid of various other sti/schedule/cli asm
      magic in other archs.
      
      Reported-and-Tested-by: default avatarTony Luck <tony.luck@gmail.com>
      Fixes: aa0d5326 ("ia64: Use preempt_schedule_irq")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      [slightly edited Changelog]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      32e475d7
    • Eric W. Biederman's avatar
      fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) · 1f7f4dde
      Eric W. Biederman authored
      
      
      Serge Hallyn <serge.hallyn@ubuntu.com> writes:
      > Hi Oleg,
      >
      > commit 40a0d32d :
      > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks"
      > breaks lxc-attach in 3.12.  That code forks a child which does
      > setns() and then does a clone(CLONE_PARENT).  That way the
      > grandchild can be in the right namespaces (which the child was
      > not) and be a child of the original task, which is the monitor.
      >
      > lxc-attach in 3.11 was working fine with no side effects that I
      > could see.  Is there a real danger in allowing CLONE_PARENT
      > when current->nsproxy->pidns_for_children is not our pidns,
      > or was this done out of an "over-abundance of caution"?  Can we
      > safely revert that new extra check?
      
      The two fundamental things I know we can not allow are:
      - A shared signal queue aka CLONE_THREAD.  Because we compute the pid
        and uid of the signal when we place it in the queue.
      
      - Changing the pid and by extention pid_namespace of an existing
        process.
      
      From a parents perspective there is nothing special about the pid
      namespace, to deny CLONE_PARENT, because the parent simply won't know or
      care.
      
      From the childs perspective all that is special really are shared signal
      queues.
      
      User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks
      in different pid namespaces is almost certainly going to break because
      it is complicated.  But shared signal handlers can look at per thread
      information to know which pid namespace a process is in, so I don't know
      of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads
      at the kernel level.  It would be absolutely stupid to implement but
      that is a different thing.
      
      So hmm.
      
      Because it can do no harm, and because it is a regression let's remove
      the CLONE_PARENT check and send it stable.
      
      Cc: stable@vger.kernel.org
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Acked-by: default avatarSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      1f7f4dde
  18. Nov 26, 2013
    • Steven Rostedt (Red Hat)'s avatar
      ftrace: Fix function graph with loading of modules · 8a56d776
      Steven Rostedt (Red Hat) authored
      
      
      Commit 8c4f3c3f "ftrace: Check module functions being traced on reload"
      fixed module loading and unloading with respect to function tracing, but
      it missed the function graph tracer. If you perform the following
      
       # cd /sys/kernel/debug/tracing
       # echo function_graph > current_tracer
       # modprobe nfsd
       # echo nop > current_tracer
      
      You'll get the following oops message:
      
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 2910 at /linux.git/kernel/trace/ftrace.c:1640 __ftrace_hash_rec_update.part.35+0x168/0x1b9()
       Modules linked in: nfsd exportfs nfs_acl lockd ipt_MASQUERADE sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt
       CPU: 2 PID: 2910 Comm: bash Not tainted 3.13.0-rc1-test #7
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000668 ffff8800787efcf8 ffffffff814fe193 ffff88007d500000
        0000000000000000 ffff8800787efd38 ffffffff8103b80a 0000000000000668
        ffffffff810b2b9a ffffffff81a48370 0000000000000001 ffff880037aea000
       Call Trace:
        [<ffffffff814fe193>] dump_stack+0x4f/0x7c
        [<ffffffff8103b80a>] warn_slowpath_common+0x81/0x9b
        [<ffffffff810b2b9a>] ? __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff8103b83e>] warn_slowpath_null+0x1a/0x1c
        [<ffffffff810b2b9a>] __ftrace_hash_rec_update.part.35+0x168/0x1b9
        [<ffffffff81502f89>] ? __mutex_lock_slowpath+0x364/0x364
        [<ffffffff810b2cc2>] ftrace_shutdown+0xd7/0x12b
        [<ffffffff810b47f0>] unregister_ftrace_graph+0x49/0x78
        [<ffffffff810c4b30>] graph_trace_reset+0xe/0x10
        [<ffffffff810bf393>] tracing_set_tracer+0xa7/0x26a
        [<ffffffff810bf5e1>] tracing_set_trace_write+0x8b/0xbd
        [<ffffffff810c501c>] ? ftrace_return_to_handler+0xb2/0xde
        [<ffffffff811240a8>] ? __sb_end_write+0x5e/0x5e
        [<ffffffff81122aed>] vfs_write+0xab/0xf6
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff81122dbd>] SyS_write+0x59/0x82
        [<ffffffff8150a185>] ftrace_graph_caller+0x85/0x85
        [<ffffffff8150a2d2>] system_call_fastpath+0x16/0x1b
       ---[ end trace 940358030751eafb ]---
      
      The above mentioned commit didn't go far enough. Well, it covered the
      function tracer by adding checks in __register_ftrace_function(). The
      problem is that the function graph tracer circumvents that (for a slight
      efficiency gain when function graph trace is running with a function
      tracer. The gain was not worth this).
      
      The problem came with ftrace_startup() which should always be called after
      __register_ftrace_function(), if you want this bug to be completely fixed.
      
      Anyway, this solution moves __register_ftrace_function() inside of
      ftrace_startup() and removes the need to call them both.
      
      Reported-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Fixes: ed926f9b ("ftrace: Use counters to enable functions to trace")
      Cc: stable@vger.kernel.org # 3.0+
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      8a56d776
  19. Nov 25, 2013
  20. Nov 22, 2013
    • Li Bin's avatar
      workqueue: fix pool ID allocation leakage and remove BUILD_BUG_ON() in init_workqueues · 4e8b22bd
      Li Bin authored
      
      
      When one work starts execution, the high bits of work's data contain
      pool ID. It can represent a maximum of WORK_OFFQ_POOL_NONE. Pool ID
      is assigned WORK_OFFQ_POOL_NONE when the work being initialized
      indicating that no pool is associated and get_work_pool() uses it to
      check the associated pool. So if worker_pool_assign_id() assigns a
      ID greater than or equal WORK_OFFQ_POOL_NONE to a pool, it triggers
      leakage, and it may break the non-reentrance guarantee.
      
      This patch fix this issue by modifying the worker_pool_assign_id()
      function calling idr_alloc() by setting @end param WORK_OFFQ_POOL_NONE.
      
      Furthermore, in the current implementation, the BUILD_BUG_ON() in
      init_workqueues makes no sense. The number of worker pools needed
      cannot be determined at compile time, because the number of backing
      pools for UNBOUND workqueues is dynamic based on the assigned custom
      attributes. So remove it.
      
      tj: Minor comment and indentation updates.
      
      Signed-off-by: default avatarLi Bin <huawei.libin@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4e8b22bd
    • Li Bin's avatar
      workqueue: fix comment typo for __queue_work() · 9ef28a73
      Li Bin authored
      
      
      It seems the "dying" should be "draining" here.
      
      Signed-off-by: default avatarLi Bin <huawei.libin@huawei.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      9ef28a73
    • Tejun Heo's avatar
      workqueue: fix ordered workqueues in NUMA setups · 8a2b7538
      Tejun Heo authored
      
      
      An ordered workqueue implements execution ordering by using single
      pool_workqueue with max_active == 1.  On a given pool_workqueue, work
      items are processed in FIFO order and limiting max_active to 1
      enforces the queued work items to be processed one by one.
      
      Unfortunately, 4c16bd32 ("workqueue: implement NUMA affinity for
      unbound workqueues") accidentally broke this guarantee by applying
      NUMA affinity to ordered workqueues too.  On NUMA setups, an ordered
      workqueue would end up with separate pool_workqueues for different
      nodes.  Each pool_workqueue still limits max_active to 1 but multiple
      work items may be executed concurrently and out of order depending on
      which node they are queued to.
      
      Fix it by using dedicated ordered_wq_attrs[] when creating ordered
      workqueues.  The new attrs match the unbound ones except that no_numa
      is always set thus forcing all NUMA nodes to share the default
      pool_workqueue.
      
      While at it, add sanity check in workqueue creation path which
      verifies that an ordered workqueues has only the default
      pool_workqueue.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLibin <huawei.libin@huawei.com>
      Cc: stable@vger.kernel.org
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      8a2b7538
    • Oleg Nesterov's avatar
      workqueue: swap set_cpus_allowed_ptr() and PF_NO_SETAFFINITY · 91151228
      Oleg Nesterov authored
      
      
      Move the setting of PF_NO_SETAFFINITY up before set_cpus_allowed()
      in create_worker(). Otherwise userland can change ->cpus_allowed
      in between.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      91151228
    • Tejun Heo's avatar
      cgroup: use a dedicated workqueue for cgroup destruction · e5fca243
      Tejun Heo authored
      
      
      Since be445626 ("cgroup: remove synchronize_rcu() from
      cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
      freeing is performed from a work item from that point on and a later
      commit, ea15f8cc ("cgroup: split cgroup destruction into two
      steps"), moves css offlining to workqueue too.
      
      As cgroup destruction isn't depended upon for memory reclaim, the
      destruction work items were put on the system_wq; unfortunately, some
      controller may block in the destruction path for considerable duration
      while holding cgroup_mutex.  As large part of destruction path is
      synchronized through cgroup_mutex, when combined with high rate of
      cgroup removals, this has potential to fill up system_wq's max_active
      of 256.
      
      Also, it turns out that memcg's css destruction path ends up queueing
      and waiting for work items on system_wq through work_on_cpu().  If
      such operation happens while system_wq is fully occupied by cgroup
      destruction work items, work_on_cpu() can't make forward progress
      because system_wq is full and other destruction work items on
      system_wq can't make forward progress because the work item waiting
      for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
      
      This can be fixed by queueing destruction work items on a separate
      workqueue.  This patch creates a dedicated workqueue -
      cgroup_destroy_wq - for this purpose.  As these work items shouldn't
      have inter-dependencies and mostly serialized by cgroup_mutex anyway,
      giving high concurrency level doesn't buy anything and the workqueue's
      @max_active is set to 1 so that destruction work items are executed
      one by one on each CPU.
      
      Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
      cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
      separate core_initcall().  In the future, we probably want to reorder
      so that workqueue init happens before cgroup_init().
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarShawn Bohrer <shawn.bohrer@gmail.com>
      Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
      Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
      Cc: stable@vger.kernel.org # v3.9+
      e5fca243
Loading