Skip to content
  1. Oct 16, 2008
  2. Oct 12, 2008
  3. Oct 09, 2008
  4. Oct 07, 2008
  5. Oct 02, 2008
    • Andy Whitcroft's avatar
      mm: handle initialising compound pages at orders greater than MAX_ORDER · 6babc32c
      Andy Whitcroft authored
      
      
      When we initialise a compound page we initialise the page flags and head
      page pointer for all base pages spanned by that page.  When we initialise
      a gigantic page (a page of order greater than or equal to MAX_ORDER) we
      have to initialise more than MAX_ORDER_NR_PAGES pages.  Currently we
      assume that all elements of the mem_map in this page are contigious in
      memory.  However this is only guarenteed out to MAX_ORDER_NR_PAGES pages,
      and with SPARSEMEM enabled they will not be contigious.  This leads us to
      walk off the end of the first section and scribble on everything which
      follows, BAD.
      
      When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next
      section of the mem_map.  As gigantic pages can only be maximally aligned
      we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from
      the start of the page.
      
      This is a bug fix for the gigantic page support in hugetlbfs.
      
      Credit to Mel Gorman for spotting the issue.
      
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6babc32c
    • Nicholas Piggin's avatar
      mm: tiny-shmem nommu fix · 4b19de6d
      Nicholas Piggin authored
      
      
      The previous patch db203d53 ("mm:
      tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock
      ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU
      architectures because it was using the expanding truncate to signal ramfs
      to allocate a physically contiguous RAM backing the inode (otherwise it is
      unusable for "memory mapping" it to userspace).
      
      However do_truncate is what caused the lock ordering error, due to it
      taking i_mutex.  In this case, we can actually just call ramfs directly to
      allocate memory for the mapping, rather than go via truncate.
      
      Acked-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b19de6d
    • Gerald Schaefer's avatar
      memory hotplug: missing zone->lock in test_pages_isolated() · 6c1b7f68
      Gerald Schaefer authored
      
      
      __test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment
      saying that the caller must hold zone->lock. But the only caller of that
      function, test_pages_isolated(), does not hold zone->lock and the lock is
      also not acquired anywhere before. This patch adds the missing zone->lock
      to test_pages_isolated().
      
      We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages()
      during memory hotplug stress test, see trace below. This patch fixes that
      problem, it would be good if we could have it in 2.6.27.
      
      kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561!
      illegal operation: 0001 [#1] PREEMPT SMP
      Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur
      CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1
      Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28)
      Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4)
       R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
      Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00
       00000000 00000400 00014000 7e27fc01
       00606f00 7e27fc00 00013fe0 2b10dd28
       00000005 80172662 801727b2 2b10dd28
      Krnl Code: 801727de: 5810900c l %r1,12(%r9)
       801727e2: a7f4ffb3 brc 15,80172748
       801727e6: a7f40001 brc 15,801727e8
       >801727ea: a7f4ffbc brc 15,80172762
       801727ee: a7f40001 brc 15,801727f0
       801727f2: a7f4ffaf brc 15,80172750
       801727f6: 0707 bcr 0,%r7
       801727f8: 0017 unknown
      Call Trace:
      ([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4)
       [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34
       [<000000000013164c>] walk_memory_resource+0xcc/0x11c
       [<000000000019520e>] offline_pages+0x36a/0x498
       [<00000000001004d6>] remove_memory+0x36/0x44
       [<000000000028fb06>] memory_block_change_state+0x112/0x150
       [<000000000028ffb8>] store_mem_state+0x90/0xe4
       [<0000000000289c00>] sysdev_store+0x34/0x40
       [<00000000001ee048>] sysfs_write_file+0xd0/0x178
       [<000000000019b1a8>] vfs_write+0x74/0x118
       [<000000000019b9ae>] sys_write+0x46/0x7c
       [<000000000011160e>] sysc_do_restart+0x12/0x16
       [<0000000077f3e8ca>] 0x77f3e8ca
      
      Signed-off-by: default avatarGerald Schaefer <gerald.schaefer@de.ibm.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c1b7f68
  6. Sep 29, 2008
    • Balbir Singh's avatar
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh authored
      
      
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      
      Reported-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  7. Sep 23, 2008
  8. Sep 15, 2008
  9. Sep 13, 2008
    • Mel Gorman's avatar
      mm: mark the correct zone as full when scanning zonelists · 5bead2a0
      Mel Gorman authored
      
      
      The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
      scanning zonelists to keep track of where in the zonelist it is.  The
      zoneref that is returned corresponds to the the next zone that is to be
      scanned, not the current one.  It was intended to be treated as an opaque
      list.
      
      When the page allocator is scanning a zonelist, it marks elements in the
      zonelist corresponding to zones that are temporarily full.  As the
      zonelist is being updated, it uses the cursor here;
      
        if (NUMA_BUILD)
              zlc_mark_zone_full(zonelist, z);
      
      This is intended to prevent rescanning in the near future but the zoneref
      cursor does not correspond to the zone that has been found to be full.
      This is an easy misunderstanding to make so this patch corrects the
      problem by changing zoneref cursor to be the current zone being scanned
      instead of the next one.
      
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5bead2a0
  10. Sep 04, 2008
  11. Sep 03, 2008
    • KOSAKI Motohiro's avatar
      mm: size of quicklists shouldn't be proportional to the number of CPUs · b9541852
      KOSAKI Motohiro authored
      
      
      Quicklists store pages for each CPU as caches.  (Each CPU can cache
      node_free_pages/16 pages)
      
      It is used for page table cache.  exit() will increase the cache size,
      while fork() consumes it.
      
      So for example if an apache-style application runs (one parent and many
      child model), one CPU process will fork() while another CPU will process
      the middleware work and exit().
      
      At that time, the CPU on which the parent runs doesn't have page table
      cache at all.  Others (on which children runs) have maximum caches.
      
      	QList_max = (#ofCPUs - 1) x Free / 16
      	=> QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1)
      
      So, How much quicklist memory is used in the maximum case?
      
      This is proposional to # of CPUs because the limit of per cpu quicklist
      cache doesn't see the number of cpus.
      
      Above calculation mean
      
      	 Number of CPUs per node            2    4    8   16
      	 ==============================  ====================
      	 QList_max / (Free + QList_max)   5.8%  16%  30%  48%
      
      Wow! Quicklist can spend about 50% memory at worst case.
      
      My demonstration program is here
      --------------------------------------------------------------------------------
      #define _GNU_SOURCE
      
      #include <stdio.h>
      #include <errno.h>
      #include <stdlib.h>
      #include <string.h>
      #include <sched.h>
      #include <unistd.h>
      #include <sys/mman.h>
      #include <sys/wait.h>
      
      #define BUFFSIZE 512
      
      int max_cpu(void)	/* get max number of logical cpus from /proc/cpuinfo */
      {
        FILE *fd;
        char *ret, buffer[BUFFSIZE];
        int cpu = 1;
      
        fd = fopen("/proc/cpuinfo", "r");
        if (fd == NULL) {
          perror("fopen(/proc/cpuinfo)");
          exit(EXIT_FAILURE);
        }
        while (1) {
          ret = fgets(buffer, BUFFSIZE, fd);
          if (ret == NULL)
            break;
          if (!strncmp(buffer, "processor", 9))
            cpu = atoi(strchr(buffer, ':') + 2);
        }
        fclose(fd);
        return cpu;
      }
      
      void cpu_bind(int cpu)	/* bind current process to one cpu */
      {
        cpu_set_t mask;
        int ret;
      
        CPU_ZERO(&mask);
        CPU_SET(cpu, &mask);
        ret = sched_setaffinity(0, sizeof(mask), &mask);
        if (ret == -1) {
          perror("sched_setaffinity()");
          exit(EXIT_FAILURE);
        }
        sched_yield();	/* not necessary */
      }
      
      #define MMAP_SIZE (10 * 1024 * 1024)	/* 10 MB */
      #define FORK_INTERVAL 1	/* 1 second */
      
      main(int argc, char *argv[])
      {
        int cpu_max, nextcpu;
        long pagesize;
        pid_t pid;
      
        /* set max number of logical cpu */
        if (argc > 1)
          cpu_max = atoi(argv[1]) - 1;
        else
          cpu_max = max_cpu();
      
        /* get the page size */
        pagesize = sysconf(_SC_PAGESIZE);
        if (pagesize == -1) {
          perror("sysconf(_SC_PAGESIZE)");
          exit(EXIT_FAILURE);
        }
      
        /* prepare parent process */
        cpu_bind(0);
        nextcpu = cpu_max;
      
      loop:
      
        /* select destination cpu for child process by round-robin rule */
        if (++nextcpu > cpu_max)
          nextcpu = 1;
      
        pid = fork();
      
        if (pid == 0) { /* child action */
      
          char *p;
          int i;
      
          /* consume page tables */
          p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
          i = MMAP_SIZE / pagesize;
          while (i-- > 0) {
            *p = 1;
            p += pagesize;
          }
      
          /* move to other cpu */
          cpu_bind(nextcpu);
      /*
          printf("a child moved to cpu%d after mmap().\n", nextcpu);
          fflush(stdout);
       */
      
          /* back page tables to pgtable_quicklist */
          exit(0);
      
        } else if (pid > 0) { /* parent action */
      
          sleep(FORK_INTERVAL);
          waitpid(pid, NULL, WNOHANG);
      
        }
      
        goto loop;
      }
      ----------------------------------------
      
      When above program which does task migration runs, my 8GB box spends
      800MB of memory for quicklist.  This is not memory leak but doesn't seem
      good.
      
      % cat /proc/meminfo
      
      MemTotal:        7701568 kB
      MemFree:         4724672 kB
      (snip)
      Quicklists:       844800 kB
      
      because
      
      - My machine spec is
      	number of numa node: 2
      	number of cpus:      8 (4CPU x2 node)
              total mem:           8GB (4GB x2 node)
              free mem:            about 5GB
      
      - Then, 4.7GB x 16% ~= 880MB.
        So, Quicklist can use 800MB.
      
      So, if following spec machine run that program
      
         CPUs: 64 (8cpu x 8node)
         Mem:  1TB (128GB x8node)
      
      Then, quicklist can waste 300GB (= 1TB x 30%).  It is too large.
      
      So, I don't like cache policies which is proportional to # of cpus.
      
      My patch changes the number of caches
      from:
         per-cpu-cache-amount = memory_on_node / 16
      to
         per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Acked-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Tested-by: default avatarDavid Miller <davem@davemloft.net>
      Acked-by: default avatarMike Travis <travis@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9541852
    • Marcin Slusarz's avatar
      mm/bootmem: silence section mismatch warning - contig_page_data/bootmem_node_data · 52765583
      Marcin Slusarz authored
      
      
      WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data
      The variable contig_page_data references
      the variable __initdata bootmem_node_data
      If the reference is valid then annotate the
      variable with __init* (see linux/init.h) or name the variable:
      *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
      
      Signed-off-by: default avatarMarcin Slusarz <marcin.slusarz@gmail.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Sean MacLennan <smaclennan@pikatech.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52765583
    • Hisashi Hifumi's avatar
      VFS: fix dio write returning EIO when try_to_release_page fails · 6ccfa806
      Hisashi Hifumi authored
      
      
      Dio write returns EIO when try_to_release_page fails because bh is
      still referenced.
      
      The patch
      
          commit 3f31fddf
          Author: Mingming Cao <cmm@us.ibm.com>
          Date:   Fri Jul 25 01:46:22 2008 -0700
      
              jbd: fix race between free buffer and commit transaction
      
      was merged into 2.6.27-rc1, but I noticed that this patch is not enough
      to fix the race.
      
      I did fsstress test heavily to 2.6.27-rc1, and found that dio write still
      sometimes got EIO through this test.
      
      The patch above fixed race between freeing buffer(dio) and committing
      transaction(jbd) but I discovered that there is another race, freeing
      buffer(dio) and ext3/4_ordered_writepage.
      
      : background_writeout()
           ->write_cache_pages()
             ->ext3_ordered_writepage()
           	   walk_page_buffers() -> take a bh ref
       	   block_write_full_page() -> unlock_page
      		: <- end_page_writeback
                      : <- race! (dio write->try_to_release_page fails)
            	   walk_page_buffers() ->release a bh ref
      
      ext3_ordered_writepage holds bh ref and does unlock_page remaining
      taking a bh ref, so this causes the race and failure of
      try_to_release_page.
      
      To fix this race, I used the approach of falling back to buffered
      writes if try_to_release_page() fails on a page.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mingming Cao <cmm@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6ccfa806
    • Adam Litke's avatar
      mm: make setup_zone_migrate_reserve() aware of overlapping nodes · 344c790e
      Adam Litke authored
      
      
      I have gotten to the root cause of the hugetlb badness I reported back on
      August 15th.  My system has the following memory topology (note the
      overlapping node):
      
                  Node 0 Memory: 0x8000000-0x44000000
                  Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000
      
      setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking
      for a pageblock to move onto the MIGRATE_RESERVE list.  Finding no
      candidates, it happily continues the scan into 0x8000000-0x44000000.  When
      a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on
      the wrong zone.  Oops.
      
      setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes.
      
      Signed-off-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      344c790e
  12. Sep 02, 2008
  13. Aug 27, 2008
    • Mel Gorman's avatar
      [ARM] Skip memory holes in FLATMEM when reading /proc/pagetypeinfo · e80d6a24
      Mel Gorman authored
      
      
      Ordinarily, memory holes in flatmem still have a valid memmap and is safe
      to use. However, an architecture (ARM) frees up the memmap backing memory
      holes on the assumption it is never used. /proc/pagetypeinfo reads the
      whole range of pages in a zone believing that the memmap is valid and that
      pfn_valid will return false if it is not. On ARM, freeing the memmap breaks
      the page->zone linkages even though pfn_valid() returns true and the kernel
      can oops shortly afterwards due to accessing a bogus struct zone *.
      
      This patch lets architectures say when FLATMEM can have holes in the
      memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo
      will confirm that the page linkages are still valid by checking page->zone
      is still the expected zone. The lookup of page_zone is safe as there is a
      limited range of memory that is accessed when calling page_zone.  Even if
      page_zone happens to return the correct zone, the impact is that the counters
      in /proc/pagetypeinfo are slightly off but fragmentation monitoring is
      unlikely to be relevant on an embedded system.
      
      Reported-by: default avatarH Hartley Sweeten <hsweeten@visionengravers.com>
      Signed-off-by: default avatarMel Gorman <mel@csn.ul.ie>
      Tested-by: default avatarH Hartley Sweeten <hsweeten@visionengravers.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@arm.linux.org.uk>
      e80d6a24
  14. Aug 20, 2008
  15. Aug 15, 2008
    • Mikulas Patocka's avatar
      bootmem allocator: alloc_bootmem_core(): page-align the end offset · 627240aa
      Mikulas Patocka authored
      
      
      This is the minimal sequence that jams the allocator:
      
      void *p, *q, *r;
      p = alloc_bootmem(PAGE_SIZE);
      q = alloc_bootmem(64);
      free_bootmem(p, PAGE_SIZE);
      p = alloc_bootmem(PAGE_SIZE);
      r = alloc_bootmem(64);
      
      after this sequence (assuming that the allocator was empty or page-aligned
      before), pointer "q" will be equal to pointer "r".
      
      What's hapenning inside the allocator:
      p = alloc_bootmem(PAGE_SIZE);
      in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000...
      q = alloc_bootmem(64);
      in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000...
      free_bootmem(p, PAGE_SIZE);
      in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000...
      p = alloc_bootmem(PAGE_SIZE);
      in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000...
      r = alloc_bootmem(64);
      
      and now:
      
      it finds bit "2", as a place where to allocate (sidx)
      
      it hits the condition
      
      if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx))
      start_off = ALIGN(bdata->last_end_off, align);
      
      -you can see that the condition is true, so it assigns start_off =
      ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates
      over already allocated block.
      
      With the patch it tries to continue at the end of previous allocation only
      if the previous allocation ended in the middle of the page.
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Acked-by: default avatarJohannes Weiner <hannes@saeurebad.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627240aa
    • Nicholas Piggin's avatar
      x86, pat: avoid highmem cache attribute aliasing · 5843d9a4
      Nicholas Piggin authored
      
      
      Highmem code can leave ptes and tlb entries around for a given page even after
      kunmap, and after it has been freed.
      
      >From what I can gather, the PAT code may change the cache attributes of
      arbitrary physical addresses (ie. including highmem pages), which would result
      in aliases in the case that it operates on one of these lazy tlb highmem
      pages.
      
      Flushing kmaps should solve the problem.
      
      I've also just added code for conditional flushing if we haven't got
      any dangling highmem aliases -- this should help performance if we
      change page attributes frequently or systems that aren't using much
      highmem pages (eg. if < 4G RAM). Should be turned into 2 patches, but
      just for RFC...
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5843d9a4
  16. Aug 14, 2008
    • David Howells's avatar
      security: Fix setting of PF_SUPERPRIV by __capable() · 5cd9c58f
      David Howells authored
      
      
      Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
      the target process if that is not the current process and it is trying to
      change its own flags in a different way at the same time.
      
      __capable() is using neither atomic ops nor locking to protect t->flags.  This
      patch removes __capable() and introduces has_capability() that doesn't set
      PF_SUPERPRIV on the process being queried.
      
      This patch further splits security_ptrace() in two:
      
       (1) security_ptrace_may_access().  This passes judgement on whether one
           process may access another only (PTRACE_MODE_ATTACH for ptrace() and
           PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
           current is the parent.
      
       (2) security_ptrace_traceme().  This passes judgement on PTRACE_TRACEME only,
           and takes only a pointer to the parent process.  current is the child.
      
           In Smack and commoncap, this uses has_capability() to determine whether
           the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
           This does not set PF_SUPERPRIV.
      
      Two of the instances of __capable() actually only act on current, and so have
      been changed to calls to capable().
      
      Of the places that were using __capable():
      
       (1) The OOM killer calls __capable() thrice when weighing the killability of a
           process.  All of these now use has_capability().
      
       (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
           whether the parent was allowed to trace any process.  As mentioned above,
           these have been split.  For PTRACE_ATTACH and /proc, capable() is now
           used, and for PTRACE_TRACEME, has_capability() is used.
      
       (3) cap_safe_nice() only ever saw current, so now uses capable().
      
       (4) smack_setprocattr() rejected accesses to tasks other than current just
           after calling __capable(), so the order of these two tests have been
           switched and capable() is used instead.
      
       (5) In smack_file_send_sigiotask(), we need to allow privileged processes to
           receive SIGIO on files they're manipulating.
      
       (6) In smack_task_wait(), we let a process wait for a privileged process,
           whether or not the process doing the waiting is privileged.
      
      I've tested this with the LTP SELinux and syscalls testscripts.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarSerge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarCasey Schaufler <casey@schaufler-ca.com>
      Acked-by: default avatarAndrew G. Morgan <morgan@kernel.org>
      Acked-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      5cd9c58f
  17. Aug 12, 2008
Loading