- Apr 28, 2008
-
-
Mel Gorman authored
On NUMA, zone_statistics() is used to record events like numa hit, miss and foreign. It assumes that the first zone in a zonelist is the preferred zone. When multiple zonelists are replaced by one that is filtered, this is no longer the case. This patch records what the preferred zone is rather than assuming the first zone in the zonelist is it. This simplifies the reading of later patches in this set. Signed-off-by:
Mel Gorman <mel@csn.ul.ie> Signed-off-by:
Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by:
Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
Introduce a node_zonelist() helper function. It is used to lookup the appropriate zonelist given a node and a GFP mask. The patch on its own is a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If necessary, it can be merged with the next patch in this set without problems. Reviewed-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Mel Gorman <mel@csn.ul.ie> Signed-off-by:
Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
The following patches replace multiple zonelists per node with two zonelists that are filtered based on the GFP flags. The patches as a set fix a bug with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset, the MPOL_BIND will apply to the two highest zones when the highest zone is ZONE_MOVABLE. This should be considered as an alternative fix for the MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters only custom zonelists. The first patch cleans up an inconsistency where direct reclaim uses zonelist->zones where other places use zonelist. The second patch introduces a helper function node_zonelist() for looking up the appropriate zonelist for a GFP mask which simplifies patches later in the set. The third patch defines/remembers the "preferred zone" for numa statistics, as it is no longer always the first zone in a zonelist. The forth patch replaces multiple zonelists with two zonelists that are filtered. The two zonelists are due to the fact that the memoryless patchset introduces a second set of zonelists for __GFP_THISNODE. The fifth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist. The final patch introduces filtering of the zonelists based on a nodemask. Two zonelists exist per node, one for normal allocations and one for __GFP_THISNODE. Performance results varied depending on the machine configuration. In real workloads the gain/loss will depend on how much the userspace portion of the benchmark benefits from having more cache available due to reduced referencing of zonelists. These are the range of performance losses/gains when running against 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and ppc64 both NUMA and non-NUMA. loss to gain Total CPU time on Kernbench: -0.86% to 1.13% Elapsed time on Kernbench: -0.79% to 0.76% page_test from aim9: -4.37% to 0.79% brk_test from aim9: -0.71% to 4.07% fork_test from aim9: -1.84% to 4.60% exec_test from aim9: -0.71% to 1.08% This patch: The allocator deals with zonelists which indicate the order in which zones should be targeted for an allocation. Similarly, direct reclaim of pages iterates over an array of zones. For consistency, this patch converts direct reclaim to use a zonelist. No functionality is changed by this patch. This simplifies zonelist iterators in the next patch. Signed-off-by:
Mel Gorman <mel@csn.ul.ie> Acked-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Nicholas Piggin authored
Nothing in the tree uses nopage any more. Remove support for it in the core mm code and documentation (and a few stray references to it in comments). Signed-off-by:
Nick Piggin <npiggin@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oleg Nesterov authored
It is not easy to actually understand the "if (!file || !vma_merge())" code, turn it into "if (file && vma_merge())". This makes immediately obvious that the subsequent "if (file)" is superfluous. As Hugh Dickins pointed out, we can also factor out the ->i_writecount corrections, and add a small comment about that. Signed-off-by:
Oleg Nesterov <oleg@tv-sign.ru> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Hisashi Hifumi authored
DIO invalidates page cache through invalidate_inode_pages2_range(). invalidate_inode_pages2_range() sets ret=-EIO when invalidate_complete_page2() fails, but this ret is cleared if do_launder_page() succeed on a page of next index. In this case, dio is carried out even if invalidate_complete_page2() fails on some pages. This can cause inconsistency between memory and blocks on HDD because the page cache still exists. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by:
Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Ken Chen <kenchen@google.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Chuck Lever <cel@citi.umich.edu> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jeremy Fitzhardinge authored
All architectures use an effectively identical definition of online_page(), so just make it common code. x86-64, ia64, powerpc and sh are actually identical; x86-32 is slightly different. x86-32's differences arise because it puts its hotplug pages in the highmem zone. We can handle this in the generic code by inspecting the page to see if its in highmem, and update the totalhigh_pages count appropriately. This leaves init_32.c:free_new_highpage with a single caller, so I folded it into add_one_highpage_init. I also removed an incorrect comment referring to the NUMA case; any NUMA details have already been dealt with by the time online_page() is called. [akpm@linux-foundation.org: fix indenting] Signed-off-by:
Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by:
Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by:
KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com> Tested-by:
KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Christoph Lameter <clameter@sgi.com> Acked-by:
Ingo Molnar <mingo@elte.hu> Acked-by:
Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Badari Pulavarty authored
Generic helper function to remove section mappings and sysfs entries for the section of the memory we are removing. offline_pages() correctly adjusted zone and marked the pages reserved. TODO: Yasunori Goto is working on patches to free up allocations from bootmem. Signed-off-by:
Badari Pulavarty <pbadari@us.ibm.com> Acked-by:
Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Johannes Weiner authored
After the loop in walk_pte_range() pte might point to the first address after the pmd it walks. The pte_unmap() is then applied to something bad. Spotted by Roel Kluin and Andreas Schwab. Signed-off-by:
Johannes Weiner <hannes@saeurebad.de> Cc: Roel Kluin <12o3l@tiscali.nl> Cc: Andreas Schwab <schwab@suse.de> Acked-by:
Matt Mackall <mpm@selenic.com> Acked-by:
Mikael Pettersson <mikpe@it.uu.se> Cc: <stable@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 27, 2008
-
-
Christian Borntraeger authored
This patch changes the s390 memory management defintions to use the pgste field for dirty and reference bit tracking of host and guest code. Usually on s390, dirty and referenced are tracked in storage keys, which belong to the physical page. This changes with virtualization: The guest and host dirty/reference bits are defined to be the logical OR of the values for the mapping and the physical page. This patch implements the necessary changes in pgtable.h for s390. There is a common code change in mm/rmap.c, the call to page_test_and_clear_young must be moved. This is a no-op for all architecture but s390. page_referenced checks the referenced bits for the physiscal page and for all mappings: o The physical page is checked with page_test_and_clear_young. o The mappings are checked with ptep_test_and_clear_young and friends. Without pgstes (the current implementation on Linux s390) the physical page check is implemented but the mapping callbacks are no-ops because dirty and referenced are not tracked in the s390 page tables. The pgstes introduces guest and host dirty and reference bits for s390 in the host mapping. These mapping must be checked before page_test_and_clear_young resets the reference bit. Signed-off-by:
Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by:
Christian Borntraeger <borntraeger@de.ibm.com> Acked-by:
Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Carsten Otte <cotte@de.ibm.com> Signed-off-by:
Avi Kivity <avi@qumranet.com>
-
- Apr 26, 2008
-
-
Yinghai Lu authored
On big systems with lots of memory, don't print out too much during bootup, and make it easy to find if it is continuous. on 256G 8 sockets system will get [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0 [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0 [ffffe20038700000-ffffe200387fffff] potential offnode page_structs [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2 [ffffe20054700000-ffffe200547fffff] potential offnode page_structs [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2 [ffffe20070700000-ffffe200707fffff] potential offnode page_structs [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4 [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4 [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6 [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7 instead of a very long print out... Signed-off-by:
Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Thomas Gleixner <tglx@linutronix.de>
-
Yinghai Lu authored
split reserve_bootmem_core() into two functions, one which checks conflicts, and one which sets the bits. and make reserve_bootmem to loop bdata_list to cross the nodes. user could be crashkernel and ramdisk..., in case the range provided by those externalities crosses the nodes. Signed-off-by:
Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Yinghai Lu authored
need offset alignment when node_boot_start's alignment is less than the alignment required. use local node_boot_start to match alignment - so don't add extra operation in search loop. Signed-off-by:
Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Yinghai Lu authored
Make the nodes other than node 0 use bdata->last_success for fast search too. We need to use __alloc_bootmem_core() for vmemmap allocation for other nodes when numa and sparsemem/vmemmap are enabled. Also, make fail_block path increase i with incr only after ALIGN to avoid extra increase when size is larger than align. Signed-off-by:
Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Yinghai Lu authored
vmemmap allocation currently has this layout: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0 ... note that there is a 2M hole between them - not optimal. the root cause is that usemap (24 bytes) will be allocated after every 2M mem_map, and it will push next vmemmap (2M) to the next (2M) alignment. solution: try to allocate the mem_map continously. after the patch, we get: [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0 ... which is the ideal layout. and usemap will share a page because of they are allocated continuously too: sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24 sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24 ... so we make the bootmem allocation more compact and use less memory for usemap => mission accomplished ;-) Signed-off-by:
Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Apr 23, 2008
-
-
Christoph Lameter authored
Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 21, 2008
-
-
Pavel Machek authored
These are small cleanups all over the tree. Trivial style and comment changes to fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c Signed-off-by:
Pavel Machek <pavel@suse.cz> Signed-off-by:
Jesper Juhl <jesper.juhl@gmail.com>
-
- Apr 20, 2008
-
-
Daniel Walker authored
Signed-off-by:
Daniel Walker <dwalker@mvista.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Greg Kroah-Hartman <gregkh@suse.de>
-
- Apr 19, 2008
-
-
Mike Travis authored
* Use new node_to_cpumask_ptr. This creates a pointer to the cpumask for a given node. This definition is in mm patch: asm-generic-add-node_to_cpumask_ptr-macro.patch * Use new set_cpus_allowed_ptr function. Depends on: [mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch [sched-devel]: sched: add new set_cpus_allowed_ptr function [x86/latest]: x86: add cpus_scnprintf function Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: Greg Banks <gnb@melbourne.sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by:
Mike Travis <travis@sgi.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Mike Travis authored
* Modify cpuset_cpus_allowed to return the currently allowed cpuset via a pointer argument instead of as the function return value. * Use new set_cpus_allowed_ptr function. * Cleanup CPU_MASK_ALL and NODE_MASK_ALL uses. Depends on: [sched-devel]: sched: add new set_cpus_allowed_ptr function Signed-off-by:
Mike Travis <travis@sgi.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Mike Travis authored
* Replace usages of CPU_MASK_NONE, CPU_MASK_ALL, NODE_MASK_NONE, NODE_MASK_ALL to reduce stack requirements for large NR_CPUS and MAXNODES counts. * In some cases, the cpumask variable was initialized but then overwritten with another value. This is the case for changes like this: - cpumask_t oldmask = CPU_MASK_ALL; + cpumask_t oldmask; Signed-off-by:
Mike Travis <travis@sgi.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
- Apr 17, 2008
-
-
Jason Wessel authored
Fix two regressions dealing with the kgdb core. 1) kgdb_skipexception and kgdb_post_primary_code are optional functions that are only required on archs that need special exception fixups. 2) The kernel address space scope must be set on any probe_kernel_* function or archs such as ARCH=arm will not allow access to the kernel memory space. As an example, it is required to allow the full kernel address space is when you the kernel debugger to inspect a system call. Signed-off-by:
Jason Wessel <jason.wessel@windriver.com> Signed-off-by:
Ingo Molnar <mingo@elte.hu>
-
Ingo Molnar authored
add probe_kernel_read() and probe_kernel_write(). Uninlined and restricted to kernel range memory only, as suggested by Linus. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de>
-
- Apr 16, 2008
-
-
KOSAKI Motohiro authored
In a5d76b54 (memory unplug: page isolation by KAMEZAWA Hiroyuki), "isolate" migratetype added. but unfortunately, it doesn't treat /proc/pagetypeinfo display logic. this patch add "Isolate" to pagetype name field. /proc/pagetype before: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 1 9 7 4 1 1 1 1 0 0 0 Node 0, zone Normal, type Reclaimable 5 2 0 0 1 1 0 0 0 1 0 Node 0, zone Normal, type Movable 0 1 1 0 0 0 1 0 0 1 60 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 0 1 1 1 0 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 236 62 6 2 2 1 1 0 1 1 16 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve <NULL> Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 10 40 169 1 0 Node 0, zone HighMem 2 0 283 1 0 after: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 0 2 1 1 0 1 0 0 0 0 0 Node 0, zone Normal, type Reclaimable 1 1 1 1 1 0 1 1 1 0 0 Node 0, zone Normal, type Movable 0 1 1 1 0 1 0 1 0 0 196 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 1 0 0 0 1 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 1 0 1 1 0 0 0 0 1 0 200 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve Isolate Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 8 4 207 1 0 Node 0, zone HighMem 2 0 283 1 0 Signed-off-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by:
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by:
Mel Gorman <mel@csn.ul.ie> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Li Zefan authored
When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by:
Li Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by:
Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Ingo Molnar authored
Fix memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS. That causes this loop to happily walk over the limit of the sparse memory section map: for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { unsigned long section = pfn_to_section_nr(pfn); struct mem_section *ms; sparse_index_init(section, nid); set_section_nid(section, nid); ms = __nr_to_section(section); if (!ms->section_mem_map) ms->section_mem_map = sparse_encode_early_nid(nid) | SECTION_MARKED_PRESENT; 'ms' will be out of bounds and we'll corrupt a small amount of memory by encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it. The corruption might happen when encoding a non-zero node ID, or due to the SECTION_MARKED_PRESENT which is 0x1: mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0) The fix is to sanity check anything the architecture passes to sparsemem. This bug seems to be rather old (as old as sparsemem support itself), but the exact incarnation depended on random details like configs, which made this bug more prominent in v2.6.25-to-be. An additional enhancement might be to print a warning about ignored or trimmed memory ranges. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Tested-by:
Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <Yinghai.Lu@sun.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 14, 2008
-
-
Christoph Lameter authored
The per node counters are used mainly for showing data through the sysfs API. If that API is not compiled in then there is no point in keeping track of this data. Disable counters for the number of slabs and the number of total slabs if !SLUB_DEBUG. Incrementing the per node counters is also accessing a potentially contended cacheline so this could actually be a performance benefit to embedded systems. SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which is on by default). Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab() if the system is not compiled with NUMA support. [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG] Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
Christoph Lameter authored
__free_slab does some diagnostics. The resetting of mapcount etc in discard_slab() can interfere with debug processing. So move the reset immediately before the page is freed. Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
Christoph Lameter authored
Only output per cpu stats if the kernel is build for SMP. Use a capital "C" as a leading character for the processor number (same as the numa statistics that also use a capital letter "N"). Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
Christoph Lameter authored
count_partial() is used by both slabinfo and the sysfs proc support. Move the function directly before the beginning of the sysfs code so that it can be easily found. Rework the preprocessor conditional to take into account that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG. Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There is no point of keeping statistics if no one can restrive them. Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
Christoph Lameter authored
Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA. This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support. Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
Pekka Enberg authored
As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before using it. [kmem_cache_cpu structures are usually allocated from arrays defined via DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl]. Reported-by:
Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi>
-
- Apr 09, 2008
-
-
KAMEZAWA Hiroyuki authored
This should be N_NORMAL_MEMORY. N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always "true") This check is used for testing whether we can use kmalloc_node() on a node. Then, if there is a node which only contains HIGHMEM, the system will call kmalloc_node() which doesn't contain memory for the kernel. If it happens under SLUB, the kernel will panic. I think this only happens on x86_32-numa. Signed-off-by:
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 04, 2008
-
-
Balbir Singh authored
A boot option for the memory controller was discussed on lkml. It is a good idea to add it, since it saves memory for people who want to turn off the memory controller. By default the option is on for the following two reasons: 1. It provides compatibility with the current scheme where the memory controller turns on if the config option is enabled 2. It allows for wider testing of the memory controller, once the config option is enabled We still allow the create, destroy callbacks to succeed, since they are not aware of boot options. We do not populate the directory will memory resource controller specific files. Signed-off-by:
Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Apr 01, 2008
-
-
Christoph Lameter authored
Small typo in the patch recently merged to avoid the unused symbol message for count_partial(). Discussion thread with confirmation of fix at http://marc.info/?t=120696854400001&r=1&w=2 Typo in the check if we need the count_partial function that was introduced by 53625b42 Signed-off-by:
Christoph Lameter <clameter@sgi.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 30, 2008
-
-
Al Viro authored
Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 28, 2008
-
-
Linus Torvalds authored
This reverts commit 3811dbf6. The masking was not at all useless, and it was sensible. We handle GFP_ZERO in the caller, and passing it down to any page allocator logic is buggy and wrong. Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 26, 2008
-
-
Nishanth Aravamudan authored
Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5 and 2.6.25-rc5-mm1: BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531] NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088 REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000 TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3 GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004 GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100 GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000 GPR12: 0000000028000442 c000000000770080 NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c Call Trace: [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable) [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354 [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214 [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0 [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8 [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194 [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4 [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0 [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8 [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0 [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134 [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40 Instruction dump: ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25 60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010 This was tracked down to a potential livelock in return_unused_surplus_hugepages(). In the case where we have surplus pages on some node, but no free pages on the same node, we may never break out of the loop. To avoid this livelock, terminate the search if we iterate a number of times equal to the number of online nodes without freeing a page. Thanks to Andy Whitcroft and Adam Litke for helping with debugging and the patch. Signed-off-by:
Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Nishanth Aravamudan authored
Currently we show the surplus hugetlb pool state in /proc/meminfo, but not in the per-node meminfo files, even though we track the information on a per-node basis. Printing it there can help track down dynamic pool bugs including the one in the follow-on patch. Signed-off-by:
Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Daniel Yeisley authored
Commit 556a169d ("slab: fix bootstrap on memoryless node") introduced bootstrap-time cache_cache list3s for all nodes but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This patch fixes list_add() corruption in mm/slab.c seen on the ES7000. Cc: Mel Gorman <mel@csn.ul.ie> Cc: Olaf Hering <olaf@aepfle.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by:
Dan Yeisley <dan.yeisley@unisys.com> Signed-off-by:
Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by:
Christoph Lameter <clameter@sgi.com>
-