- May 23, 2011
-
-
Mandeep Singh Baines authored
Before the conversion of the NMI watchdog to perf event, the watchdog timeout was 5 seconds. Now it is 60 seconds. For my particular application, netbooks, 5 seconds was a better timeout. With a short timeout, we catch faults earlier and are able to send back a panic. With a 60 second timeout, the user is unlikely to wait and will instead hit the power button, causing us to lose the panic info. This change configures the NMI period to watchdog_thresh and sets the softlockup_thresh to watchdog_thresh * 2. In addition, watchdog_thresh was reduced to 10 seconds as suggested by Ingo Molnar. Signed-off-by:
Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-4-git-send-email-msb@chromium.org Signed-off-by:
Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071642.GF22305@elte.hu>
-
Mandeep Singh Baines authored
This restores the previous behavior of softlock_thresh. Currently, setting watchdog_thresh to zero causes the watchdog kthreads to consume a lot of CPU. In addition, the logic of proc_dowatchdog_thresh and proc_dowatchdog_enabled has been factored into proc_dowatchdog. Signed-off-by:
Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-3-git-send-email-msb@chromium.org Signed-off-by:
Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071018.GE22305@elte.hu>
-
Mandeep Singh Baines authored
Don't take any action on an unsuccessful write to /proc. Signed-off-by:
Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-2-git-send-email-msb@chromium.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Mandeep Singh Baines authored
In get_sample_period(), softlockup_thresh is integer divided by 5 before the multiplication by NSEC_PER_SEC. This results in softlockup_thresh being rounded down to the nearest integer multiple of 5. For example, a softlockup_thresh of 4 rounds down to 0. Signed-off-by:
Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-1-git-send-email-msb@chromium.org Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Apr 28, 2011
-
-
Hillf Danton authored
In corner cases where softlockup watchdog is not setup successfully, the relevant nmi perf event for hardlockup watchdog could be disabled, then the status of the underlying hardware remains unchanged. Also, if the kthread doesn't start then the hrtimer won't run and the hardlockup detector will falsely fire. Signed-off-by:
Hillf Danton <dhillf@gmail.com> Signed-off-by:
Don Zickus <dzickus@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 23, 2011
-
-
Don Zickus authored
This patch addresses a couple of problems. One was the case when the hardlockup failed to start, it also failed to start the softlockup. There were valid cases when the hardlockup shouldn't start and that shouldn't block the softlockup (no lapic, bios controls perf counters). The second problem was when the hardlockup failed to start on boxes (from a no lapic or bios controlled perf counter case), it reported failure to the cpu notifier chain. This blocked the notifier from continuing to start other more critical pieces of cpu bring-up (in our case based on a 2.6.32 fork, it was the mce). As a result, during soft cpu online/offline testing, the system would panic when a cpu was offlined because the cpu notifier would succeed in processing a watchdog disable cpu event and would panic in the mce case as a result of un-initialized variables from a never executed cpu up event. I realized the hardlockup/softlockup cases are really just debugging aids and should never impede the progress of a cpu up/down event. Therefore I modified the code to always return NOTIFY_OK and instead rely on printks to inform the user of problems. Signed-off-by:
Don Zickus <dzickus@redhat.com> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by:
WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Don Zickus authored
When a cpu is considered stuck, instead of limping along and just printing a warning, it is sometimes preferred to just panic, let kdump capture the vmcore and reboot. This gets the machine back into a stable state quickly while saving the info that got it into a stuck state to begin with. Add a Kconfig option to allow users to set the hardlockup to panic by default. Also add in a 'nmi_watchdog=nopanic' to override this. [akpm@linux-foundation.org: fix strncmp length] Signed-off-by:
Don Zickus <dzickus@redhat.com> Acked-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by:
WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Feb 10, 2011
-
-
Don Zickus authored
During boot if the hardlockup detector fails to initialize, it complains very loudly. Some failures should be expected under certain situations, ie no lapics, or resource in-use. Tone those error messages down a bit. Keep the rest at a high level. Reported-by:
Paul Bolle <pebolle@tiscali.nl> Tested-by:
Paul Bolle <pebolle@tiscali.nl> Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1297278153-21111-1-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Jan 31, 2011
-
-
Marcin Slusarz authored
Signed-off-by:
Marcin Slusarz <marcin.slusarz@gmail.com> [ add {}'s to fix a warning ] Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-3-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Marcin Slusarz authored
If it was not possible to enable watchdog for any cpu, switch watchdog_enabled back to 0, because it's visible via kernel.watchdog sysctl. Signed-off-by:
Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-2-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Marcin Slusarz authored
Passing nowatchdog to kernel disables 2 things: creation of watchdog threads AND initialization of percpu watchdog_hrtimer. As hrtimers are initialized only at boot it's not possible to enable watchdog later - for me all watchdog threads started to eat 100% of CPU time, but they could just crash. Additionally, even if these threads would start properly, watchdog_disable_all_cpus was guarded by no_watchdog check, so you couldn't disable watchdog. To fix this, remove no_watchdog variable and use already existing watchdog_enabled variable. Signed-off-by:
Marcin Slusarz <marcin.slusarz@gmail.com> [ removed another no_watchdog instance ] Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-1-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Jan 03, 2011
-
-
Ben Hutchings authored
The error message 'NMI watchdog failed to create perf event...' does not make it clear that this is a fatal error for the watchdog. It also currently prints the error value as a pointer, rather than extracting the error code with PTR_ERR(). Fix that. Add a note to the description of the 'nowatchdog' kernel parameter to associate it with this message. Reported-by:
Cesare Leonardi <celeonar@gmail.com> Signed-off-by:
Ben Hutchings <ben@decadent.org.uk> Cc: 599368@bugs.debian.org Cc: 608138@bugs.debian.org Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> # .37.x and later LKML-Reference: <1294009362.3167.126.camel@localhost> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Dec 17, 2010
-
-
Christoph Lameter authored
__get_cpu_var() can be replaced with this_cpu_read and will then use a single read instruction with implied address calculation to access the correct per cpu instance. However, the address of a per cpu variable passed to __this_cpu_read() cannot be determined (since it's an implied address conversion through segment prefixes). Therefore apply this only to uses of __get_cpu_var where the address of the variable is not used. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by:
H. Peter Anvin <hpa@zytor.com> Signed-off-by:
Christoph Lameter <cl@linux.com> Signed-off-by:
Tejun Heo <tj@kernel.org>
-
- Dec 09, 2010
-
-
Don Zickus authored
Originally adapted from Huang Ying's patch which moved the unknown_nmi_panic to the traps.c file. Because the old nmi watchdog was deleted before this change happened, the unknown_nmi_panic sysctl was lost. This re-adds it. Also, the nmi_watchdog sysctl was re-implemented and its documentation updated accordingly. Patch-inspired-by:
Huang Ying <ying.huang@intel.com> Signed-off-by:
Don Zickus <dzickus@redhat.com> Reviewed-by:
Cyrill Gorcunov <gorcunov@gmail.com> Acked-by:
Yinghai Lu <yinghai@kernel.org> Cc: fweisbec@gmail.com LKML-Reference: <1291068437-5331-3-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Nov 26, 2010
-
-
Peter Zijlstra authored
The perf hardware pmu got initialized at various points in the boot, some before early_initcall() some after (notably arch_initcall). The problem is that the NMI lockup detector is ran from early_initcall() and expects the hardware pmu to be present. Sanitize this by moving all architecture hardware pmu implementations to initialize at early_initcall() and move the lockup detector to an explicit initcall right after that. Cc: paulus <paulus@samba.org> Cc: davem <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Acked-by:
Paul Mundt <lethal@linux-sh.org> Acked-by:
Will Deacon <will.deacon@arm.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290707759.2145.119.camel@laptop> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Nov 06, 2010
-
-
David Daney authored
Commit d9ca07a0 ("watchdog: Avoid kernel crash when disabling watchdog") introduces a section mismatch. Now that we reference no_watchdog from non-__init code it can no longer be __initdata. Signed-off-by:
David Daney <ddaney@caviumnetworks.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Oct 23, 2010
-
-
KOSAKI Motohiro authored
Andrew Morton pointed out almost all sched_setscheduler() callers are using fixed parameters and can be converted to static. It reduces runtime memory use a little. Signed-off-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
James Morris <jmorris@namei.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Sep 15, 2010
-
-
Matt Helsley authored
The kernel perf event creation path shouldn't use find_task_by_vpid() because a vpid exists in a specific namespace. find_task_by_vpid() uses current's pid namespace which isn't always the correct namespace to use for the vpid in all the places perf_event_create_kernel_counter() (and thus find_get_context()) is called. The goal is to clean up pid namespace handling and prevent bugs like: https://bugzilla.kernel.org/show_bug.cgi?id=17281 Instead of using pids switch find_get_context() to use task struct pointers directly. The syscall is responsible for resolving the pid to a task struct. This moves the pid namespace resolution into the syscall much like every other syscall that takes pid parameters. Signed-off-by:
Matt Helsley <matthltc@us.ibm.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robin Green <greenrd@greenrd.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> LKML-Reference: <a134e5e392ab0204961fd1a62c84a222bf5874a9.1284407763.git.matthltc@us.ibm.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Stephane Eranian authored
In case you boot with the watchdog disabled, i.e., nowatchdog, then, if you try to disable it via /proc/sys/kernel/watchdog, you get a kernel crash. The reason is that you are trying to cancel a hrtimer which has never been initialized. This patch fixes this by skipping execution of watchdog_disable_all_cpus() when the watchdog is marked disabled from boot. Signed-off-by:
Stephane Eranian <eranian@google.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4c8f7a23.cae9d80a.2c11.0bb4@mx.google.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Sep 01, 2010
-
-
Don Zickus authored
During my rewrite, the semantics of touch_nmi_watchdog and touch_softlockup_watchdog changed enough to break some drivers (mostly over preemptable regions). These are cases where long delays on one CPU (due to print_delay for example) can cause long delays on other CPUs - so we must 'touch' the nmi_watchdog flag of those other CPUs as well. This change brings those touch_*_watchdog() functions back in line with to how they used to work. Signed-off-by:
Don Zickus <dzickus@redhat.com> Acked-by:
Cyrill Gorcunov <gorcunov@openvz.org> Cc: peterz@infradead.org Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-2-git-send-email-dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Akinobu Mita authored
The panic notifer in lockup_detector just set did_panic to 1. But did_panic is not used anywhere so we can just remove it. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-4-git-send-email-dzickus@redhat.com> Signed-off-by:
Don Zickus <dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Akinobu Mita authored
By the commit e6bde73b ("cpu-hotplug: return better errno on cpu hotplug failure"), the cpu notifier can return encapsulate errno value, resulting in more meaningful error codes for CPU hotplug failures. This converts the cpu notifier to return encapsulate errno value for the lockup_detector as well. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: peterz@infradead.org Cc: gorcunov@gmail.com Cc: fweisbec@gmail.com LKML-Reference: <1283310009-22168-3-git-send-email-dzickus@redhat.com> Signed-off-by:
Don Zickus <dzickus@redhat.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Aug 23, 2010
-
-
Peter Zijlstra authored
Stephane reported that when the machine locks up, the regular ticks, which are responsible to resetting the throttle count, stop too. Hence the NMI watchdog can end up being throttled before it reports on the locked up state, and we end up being sad.. Cure this by having the watchdog overflow reset its own throttle count. Reported-by:
Stephane Eranian <eranian@google.com> Tested-by:
Stephane Eranian <eranian@google.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282215916.1926.4696.camel@laptop> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Aug 20, 2010
-
-
Lin Ming authored
watchdog_overflow_callback() is only used in kernel/watchdog.c. Signed-off-by:
Lin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Don Zickus <dzickus@redhat.com> LKML-Reference: <1282273431.16443.32.camel@minggr.sh.intel.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Jul 07, 2010
-
-
Kulikov Vasiliy authored
Variable on the stack is not initialized to zero, do it explicitly. This bug was found by a compiler warning: kernel/watchdog.c:463: warning: 'result' may be used uninitialized in this function Signed-off-by:
Kulikov Vasiliy <segooon@gmail.com> Acked-by:
Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> LKML-Reference: <1278316854-28442-1-git-send-email-segooon@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- May 19, 2010
-
-
Don Zickus authored
Just a bunch of conversions as suggested by Frederic W. __get_cpu_var() provides preemption disabled checks. Plus it gives more readability as it makes it obvious we are dealing locally now with these vars. Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1274133966-18415-2-git-send-email-dzickus@redhat.com> Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-
- May 16, 2010
-
-
Don Zickus authored
Combining the softlockup and hardlockup code causes watchdog.c to build even without the hardlockup detection support. So if an arch, that has the previous and the new nmi watchdog implementations cohabiting, wants to know if the generic one is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check. We need to use CONFIG_HARDLOCKUP_DETECTOR instead. Fixes: kernel/built-in.o: In function `touch_nmi_watchdog': (.text+0x449bc): multiple definition of `touch_nmi_watchdog' arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <20100514151121.GR15159@redhat.com> [ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI] Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-
- May 15, 2010
-
-
Frederic Weisbecker authored
This new config is deemed to simplify even more the lockup detector dependencies and can make it easier to bring a smooth sorting between archs that support the new generic lockup detector and those that still have their own, especially for those that are in the middle of this migration. Instead of checking whether we have CONFIG_LOCKUP_DETECTOR + CONFIG_PERF_EVENTS_NMI each time an arch wants to know if it needs to build its own lockup detector, take a shortcut with this new config. It is enabled only if the hardlockup detection part of the whole lockup detector is on. Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
-
- May 13, 2010
-
-
Ingo Molnar authored
There are modules that rely on it: ERROR: "touch_softlockup_watchdog" [drivers/video/nvidia/nvidiafb.ko] undefined! Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> LKML-Reference: <1273713674-8434-1-git-send-regression-fweisbec@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- May 12, 2010
-
-
Don Zickus authored
When I combined the nmi_watchdog (hardlockup) and softlockup code, I also combined the paths the touch_watchdog and touch_nmi_watchdog took. This may not be the best idea as pointed out by Frederic W., that the touch_watchdog case probably should not reset the hardlockup count. Therefore the patch below falls back to the previous idea of keeping the touch_nmi_watchdog a superset of the touch_watchdog case. Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-9-git-send-email-dzickus@redhat.com> Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-
Don Zickus authored
Just some code cleanup to make touch_softlockup clearer and remove the softlockup_tick function as it is no longer needed. Also remove the /proc softlockup_thres call as it has been changed to watchdog_thres. Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-3-git-send-email-dzickus@redhat.com> Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-
Don Zickus authored
The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file: kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions V4: - documentation fixes - convert per_cpu to __get_cpu_var - powerpc compile fixes V5: - split apart warn flags for hard and soft lockups TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period [fweisbec: merged conflict patch] Signed-off-by:
Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> LKML-Reference: <1273266711-18706-2-git-send-email-dzickus@redhat.com> Signed-off-by:
Frederic Weisbecker <fweisbec@gmail.com>
-