Skip to content
  1. Nov 07, 2015
    • Davidlohr Bueso's avatar
      ipc,msg: drop dst nil validation in copy_msg · 5f2a2d5d
      Davidlohr Bueso authored
      
      
      d0edd852 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the
      nil dst parameter check, originally being a full BUG_ON.  However, this
      check seems quite unnecessary when the only purpose is for
      ceckpoint/restore (MSG_COPY flag):
      
      o The copy variable is set initially to nil, apparently as a way of
        ensuring that prepare_copy is previously called.  Which is in fact done,
        unconditionally at the beginning of do_msgrcv.
      
      o There is no concurrency with 'copy' (stack allocated in do_msgrcv).
      
      Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should
      always handled by IS_ERR() family.  Therefore remove this check altogether
      as it can never occur with the current users.
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f2a2d5d
  2. Sep 30, 2015
    • Linus Torvalds's avatar
      Initialize msg/shm IPC objects before doing ipc_addid() · b9a53227
      Linus Torvalds authored
      
      
      As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before
      having initialized the IPC object state.  Yes, we initialize the IPC
      object in a locked state, but with all the lockless RCU lookup work,
      that IPC object lock no longer means that the state cannot be seen.
      
      We already did this for the IPC semaphore code (see commit e8577d1f:
      "ipc/sem.c: fully initialize sem_array before making it visible") but we
      clearly forgot about msg and shm.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9a53227
  3. Sep 10, 2015
    • Davidlohr Bueso's avatar
      ipc: convert invalid scenarios to use WARN_ON · d0edd852
      Davidlohr Bueso authored
      
      
      Considering Linus' past rants about the (ab)use of BUG in the kernel, I
      took a look at how we deal with such calls in ipc.  Given that any errors
      or corruption in ipc code are most likely contained within the set of
      processes participating in the broken mechanisms, there aren't really many
      strong fatal system failure scenarios that would require a BUG call.
      Also, if something is seriously wrong, ipc might not be the place for such
      a BUG either.
      
      1. For example, recently, a customer hit one of these BUG_ONs in shm
         after failing shm_lock().  A busted ID imho does not merit a BUG_ON,
         and WARN would have been better.
      
      2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
         I don't see how we can hit this anyway -- at least it should be IS_ERR.
          The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
         first and foremost.  We could also probably drop this check altogether.
          Either way, it does not merit a BUG_ON.
      
      3. No ->fault() callback for the fs getting the corresponding page --
         seems selfish to make the system unusable.
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0edd852
  4. Aug 14, 2015
    • Manfred Spraul's avatar
      ipc/sem.c: update/correct memory barriers · 3ed1f8a9
      Manfred Spraul authored
      
      
      sem_lock() did not properly pair memory barriers:
      
      !spin_is_locked() and spin_unlock_wait() are both only control barriers.
      The code needs an acquire barrier, otherwise the cpu might perform read
      operations before the lock test.
      
      As no primitive exists inside <include/spinlock.h> and since it seems
      noone wants another primitive, the code creates a local primitive within
      ipc/sem.c.
      
      With regards to -stable:
      
      The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
      nop (just a preprocessor redefinition to improve the readability).  The
      bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
      starting from 3.10).
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed1f8a9
    • Herton R. Krzesinski's avatar
      ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem() · a9795584
      Herton R. Krzesinski authored
      
      
      After we acquire the sma->sem_perm lock in exit_sem(), we are protected
      against a racing IPC_RMID operation.  Also at that point, we are the last
      user of sem_undo_list.  Therefore it isn't required that we acquire or use
      ulp->lock.
      
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      CC: Aristeu Rozanski <aris@redhat.com>
      Cc: David Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9795584
    • Herton R. Krzesinski's avatar
      ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits · 602b8593
      Herton R. Krzesinski authored
      
      
      The current semaphore code allows a potential use after free: in
      exit_sem we may free the task's sem_undo_list while there is still
      another task looping through the same semaphore set and cleaning the
      sem_undo list at freeary function (the task called IPC_RMID for the same
      semaphore set).
      
      For example, with a test program [1] running which keeps forking a lot
      of processes (which then do a semop call with SEM_UNDO flag), and with
      the parent right after removing the semaphore set with IPC_RMID, and a
      kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and
      CONFIG_DEBUG_SPINLOCK, you can easily see something like the following
      in the kernel log:
      
         Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
         000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b  kkkkkkkk.kkkkkkk
         010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff  ....kkkk........
         Prev obj: start=ffff88003b45c180, len=64
         000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a  .....N......ZZZZ
         010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff  ...........7....
         Next obj: start=ffff88003b45c200, len=64
         000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a  .....N......ZZZZ
         010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff  ........h).<....
         BUG: spinlock wrong CPU on CPU#2, test/18028
         general protection fault: 0000 [#1] SMP
         Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
         CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         RIP: spin_dump+0x53/0xc0
         Call Trace:
           spin_bug+0x30/0x40
           do_raw_spin_unlock+0x71/0xa0
           _raw_spin_unlock+0xe/0x10
           freeary+0x82/0x2a0
           ? _raw_spin_lock+0xe/0x10
           semctl_down.clone.0+0xce/0x160
           ? __do_page_fault+0x19a/0x430
           ? __audit_syscall_entry+0xa8/0x100
           SyS_semctl+0x236/0x2c0
           ? syscall_trace_leave+0xde/0x130
           entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
         RIP  [<ffffffff810d6053>] spin_dump+0x53/0xc0
          RSP <ffff88003750fd68>
         ---[ end trace 783ebb76612867a0 ]---
         NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
         Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
         CPU: 3 PID: 18053 Comm: test Tainted: G      D         4.2.0-rc5+ #1
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         RIP: native_read_tsc+0x0/0x20
         Call Trace:
           ? delay_tsc+0x40/0x70
           __delay+0xf/0x20
           do_raw_spin_lock+0x96/0x140
           _raw_spin_lock+0xe/0x10
           sem_lock_and_putref+0x11/0x70
           SYSC_semtimedop+0x7bf/0x960
           ? handle_mm_fault+0xbf6/0x1880
           ? dequeue_task_fair+0x79/0x4a0
           ? __do_page_fault+0x19a/0x430
           ? kfree_debugcheck+0x16/0x40
           ? __do_page_fault+0x19a/0x430
           ? __audit_syscall_entry+0xa8/0x100
           ? do_audit_syscall_entry+0x66/0x70
           ? syscall_trace_enter_phase1+0x139/0x160
           SyS_semtimedop+0xe/0x10
           SyS_semop+0x10/0x20
           entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
         Kernel panic - not syncing: softlockup: hung tasks
      
      I wasn't able to trigger any badness on a recent kernel without the
      proper config debugs enabled, however I have softlockup reports on some
      kernel versions, in the semaphore code, which are similar as above (the
      scenario is seen on some servers running IBM DB2 which uses semaphore
      syscalls).
      
      The patch here fixes the race against freeary, by acquiring or waiting
      on the sem_undo_list lock as necessary (exit_sem can race with freeary,
      while freeary sets un->semid to -1 and removes the same sem_undo from
      list_proc or when it removes the last sem_undo).
      
      After the patch I'm unable to reproduce the problem using the test case
      [1].
      
      [1] Test case used below:
      
          #include <stdio.h>
          #include <sys/types.h>
          #include <sys/ipc.h>
          #include <sys/sem.h>
          #include <sys/wait.h>
          #include <stdlib.h>
          #include <time.h>
          #include <unistd.h>
          #include <errno.h>
      
          #define NSEM 1
          #define NSET 5
      
          int sid[NSET];
      
          void thread()
          {
                  struct sembuf op;
                  int s;
                  uid_t pid = getuid();
      
                  s = rand() % NSET;
                  op.sem_num = pid % NSEM;
                  op.sem_op = 1;
                  op.sem_flg = SEM_UNDO;
      
                  semop(sid[s], &op, 1);
                  exit(EXIT_SUCCESS);
          }
      
          void create_set()
          {
                  int i, j;
                  pid_t p;
                  union {
                          int val;
                          struct semid_ds *buf;
                          unsigned short int *array;
                          struct seminfo *__buf;
                  } un;
      
                  /* Create and initialize semaphore set */
                  for (i = 0; i < NSET; i++) {
                          sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
                          if (sid[i] < 0) {
                                  perror("semget");
                                  exit(EXIT_FAILURE);
                          }
                  }
                  un.val = 0;
                  for (i = 0; i < NSET; i++) {
                          for (j = 0; j < NSEM; j++) {
                                  if (semctl(sid[i], j, SETVAL, un) < 0)
                                          perror("semctl");
                          }
                  }
      
                  /* Launch threads that operate on semaphore set */
                  for (i = 0; i < NSEM * NSET * NSET; i++) {
                          p = fork();
                          if (p < 0)
                                  perror("fork");
                          if (p == 0)
                                  thread();
                  }
      
                  /* Free semaphore set */
                  for (i = 0; i < NSET; i++) {
                          if (semctl(sid[i], NSEM, IPC_RMID))
                                  perror("IPC_RMID");
                  }
      
                  /* Wait for forked processes to exit */
                  while (wait(NULL)) {
                          if (errno == ECHILD)
                                  break;
                  };
          }
      
          int main(int argc, char **argv)
          {
                  pid_t p;
      
                  srand(time(NULL));
      
                  while (1) {
                          p = fork();
                          if (p < 0) {
                                  perror("fork");
                                  exit(EXIT_FAILURE);
                          }
                          if (p == 0) {
                                  create_set();
                                  goto end;
                          }
      
                          /* Wait for forked processes to exit */
                          while (wait(NULL)) {
                                  if (errno == ECHILD)
                                          break;
                          };
                  }
          end:
                  return 0;
          }
      
      [akpm@linux-foundation.org: use normal comment layout]
      Signed-off-by: default avatarHerton R. Krzesinski <herton@redhat.com>
      Acked-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rafael Aquini <aquini@redhat.com>
      CC: Aristeu Rozanski <aris@redhat.com>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      602b8593
  5. Aug 07, 2015
    • Stephen Smalley's avatar
      ipc: use private shmem or hugetlbfs inodes for shm segments. · e1832f29
      Stephen Smalley authored
      
      
      The shm implementation internally uses shmem or hugetlbfs inodes for shm
      segments.  As these inodes are never directly exposed to userspace and
      only accessed through the shm operations which are already hooked by
      security modules, mark the inodes with the S_PRIVATE flag so that inode
      security initialization and permission checking is skipped.
      
      This was motivated by the following lockdep warning:
      
        ======================================================
         [ INFO: possible circular locking dependency detected ]
         4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
        -------------------------------------------------------
         httpd/1597 is trying to acquire lock:
         (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
         but task is already holding lock:
         (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
         which lock already depends on the new lock.
         the existing dependency chain (in reverse order) is:
         -> #3 (&mm->mmap_sem){++++++}:
              lock_acquire+0xc7/0x270
              __might_fault+0x7a/0xa0
              filldir+0x9e/0x130
              xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
              xfs_readdir+0x1b4/0x330 [xfs]
              xfs_file_readdir+0x2b/0x30 [xfs]
              iterate_dir+0x97/0x130
              SyS_getdents+0x91/0x120
              entry_SYSCALL_64_fastpath+0x12/0x76
         -> #2 (&xfs_dir_ilock_class){++++.+}:
              lock_acquire+0xc7/0x270
              down_read_nested+0x57/0xa0
              xfs_ilock+0x167/0x350 [xfs]
              xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
              xfs_attr_get+0xbd/0x190 [xfs]
              xfs_xattr_get+0x3d/0x70 [xfs]
              generic_getxattr+0x4f/0x70
              inode_doinit_with_dentry+0x162/0x670
              sb_finish_set_opts+0xd9/0x230
              selinux_set_mnt_opts+0x35c/0x660
              superblock_doinit+0x77/0xf0
              delayed_superblock_init+0x10/0x20
              iterate_supers+0xb3/0x110
              selinux_complete_init+0x2f/0x40
              security_load_policy+0x103/0x600
              sel_write_load+0xc1/0x750
              __vfs_write+0x37/0x100
              vfs_write+0xa9/0x1a0
              SyS_write+0x58/0xd0
              entry_SYSCALL_64_fastpath+0x12/0x76
        ...
      
      Signed-off-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Reported-by: default avatarMorten Stevens <mstevens@fedoraproject.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1832f29
    • Marcus Gelderie's avatar
      ipc: modify message queue accounting to not take kernel data structures into account · de54b9ac
      Marcus Gelderie authored
      A while back, the message queue implementation in the kernel was
      improved to use btrees to speed up retrieval of messages, in commit
      d6629859 ("ipc/mqueue: improve performance of send/recv").
      
      That patch introducing the improved kernel handling of message queues
      (using btrees) has, as a by-product, changed the meaning of the QSIZE
      field in the pseudo-file created for the queue.  Before, this field
      reflected the size of the user-data in the queue.  Since, it also takes
      kernel data structures into account.  For example, if 13 bytes of user
      data are in the queue, on my machine the file reports a size of 61
      bytes.
      
      There was some discussion on this topic before (for example
      https://lkml.org/lkml/2014/10/1/115).  Commenting on a th lkml, Michael
      Kerrisk gave the following background
      (https://lkml.org/lkml/2015/6/16/74
      
      ):
      
          The pseudofiles in the mqueue filesystem (usually mounted at
          /dev/mqueue) expose fields with metadata describing a message
          queue. One of these fields, QSIZE, as originally implemented,
          showed the total number of bytes of user data in all messages in
          the message queue, and this feature was documented from the
          beginning in the mq_overview(7) page. In 3.5, some other (useful)
          work happened to break the user-space API in a couple of places,
          including the value exposed via QSIZE, which now includes a measure
          of kernel overhead bytes for the queue, a figure that renders QSIZE
          useless for its original purpose, since there's no way to deduce
          the number of overhead bytes consumed by the implementation.
          (The other user-space breakage was subsequently fixed.)
      
      This patch removes the accounting of kernel data structures in the
      queue.  Reporting the size of these data-structures in the QSIZE field
      was a breaking change (see Michael's comment above).  Without the QSIZE
      field reporting the total size of user-data in the queue, there is no
      way to deduce this number.
      
      It should be noted that the resource limit RLIMIT_MSGQUEUE is counted
      against the worst-case size of the queue (in both the old and the new
      implementation).  Therefore, the kernel overhead accounting in QSIZE is
      not necessary to help the user understand the limitations RLIMIT imposes
      on the processes.
      
      Signed-off-by: default avatarMarcus Gelderie <redmnic@gmail.com>
      Acked-by: default avatarDoug Ledford <dledford@redhat.com>
      Acked-by: default avatarMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: John Duffy <jb_duffy@btinternet.com>
      Cc: Arto Bendiken <arto@bendiken.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de54b9ac
  6. Jul 01, 2015
  7. May 08, 2015
    • Davidlohr Bueso's avatar
      ipc/mqueue: Implement lockless pipelined wakeups · fa6004ad
      Davidlohr Bueso authored
      
      
      This patch moves the wakeup_process() invocation so it is not done under
      the info->lock by making use of a lockless wake_q. With this change, the
      waiter is woken up once it is STATE_READY and it does not need to loop
      on SMP if it is still in STATE_PENDING. In the timeout case we still need
      to grab the info->lock to verify the state.
      
      This change should also avoid the introduction of preempt_disable() in -rt
      which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY
      change if the waiter has a higher priority compared to the waker.
      
      Additionally, this patch micro-optimizes wq_sleep by using the cheaper
      cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no
      matter what, thus get rid of the implied barrier.
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarGeorge Spelvin <linux@horizon.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fa6004ad
  8. Apr 15, 2015
  9. Feb 17, 2015
  10. Dec 13, 2014
  11. Dec 04, 2014
  12. Dec 03, 2014
  13. Nov 19, 2014
  14. Oct 14, 2014
  15. Sep 09, 2014
  16. Aug 08, 2014
    • Jack Miller's avatar
      shm: allow exit_shm in parallel if only marking orphans · 83293c0f
      Jack Miller authored
      
      
      If shm_rmid_force (the default state) is not set then the shmids are only
      marked as orphaned and does not require any add, delete, or locking of the
      tree structure.
      
      Seperate the sysctl on and off case, and only obtain the read lock.  The
      newly added list head can be deleted under the read lock because we are
      only called with current and will only change the semids allocated by this
      task and not manipulate the list.
      
      This commit assumes that up_read includes a sufficient memory barrier for
      the writes to be seen my others that later obtain a write lock.
      
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83293c0f
    • Jack Miller's avatar
      shm: make exit_shm work proportional to task activity · ab602f79
      Jack Miller authored
      This is small set of patches our team has had kicking around for a few
      versions internally that fixes tasks getting hung on shm_exit when there
      are many threads hammering it at once.
      
      Anton wrote a simple test to cause the issue:
      
        http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
      
      
      
      Before applying this patchset, this test code will cause either hanging
      tracebacks or pthread out of memory errors.
      
      After this patchset, it will still produce output like:
      
        root@somehost:~# ./bust_shm_exit 1024 160
        ...
        INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
        INFO: Stall ended before state dump start
        ...
      
      But the task will continue to run along happily, so we consider this an
      improvement over hanging, even if it's a bit noisy.
      
      This patch (of 3):
      
      exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
      walks every shared memory segment in the namespace.  Thus the amount of
      work is related to the number of shm segments in the namespace not the
      number of segments that might need to be cleaned.
      
      In addition, this occurs after the task has been notified the thread has
      exited, so the number of tasks waiting for the ns shm rwsem can grow
      without bound until memory is exausted.
      
      Add a list to the task struct of all shmids allocated by this task.  Init
      the list head in copy_process.  Use the ns->rwsem for locking.  Add
      segments after id is added, remove before removing from id.
      
      On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
      to handling of semaphore undo.
      
      I chose a define for the init sequence since its a simple list init,
      otherwise it would require a function call to avoid include loops between
      the semaphore code and the task struct.  Converting the list_del to
      list_del_init for the unshare cases would remove the exit followed by
      init, but I left it blow up if not inited.
      
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ab602f79
  17. Jul 30, 2014
    • Eric W. Biederman's avatar
      namespaces: Use task_lock and not rcu to protect nsproxy · 728dba3a
      Eric W. Biederman authored
      
      
      The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
      a sufficiently expensive system call that people have complained.
      
      Upon inspect nsproxy no longer needs rcu protection for remote reads.
      remote reads are rare.  So optimize for same process reads and write
      by switching using rask_lock instead.
      
      This yields a simpler to understand lock, and a faster setns system call.
      
      In particular this fixes a performance regression observed
      by Rafael David Tinoco <rafael.tinoco@canonical.com>.
      
      This is effectively a revert of Pavel Emelyanov's commit
      cf7b708c Make access to task's nsproxy lighter
      from 2007.  The race this originialy fixed no longer exists as
      do_notify_parent uses task_active_pid_ns(parent) instead of
      parent->nsproxy.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      728dba3a
  18. Jun 06, 2014
    • Joe Perches's avatar
      ipc: convert use of typedef ctl_table to struct ctl_table · a5c5928b
      Joe Perches authored
      
      
      This typedef is unnecessary and should just be removed.
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5c5928b
    • Manfred Spraul's avatar
      ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT) · 9b44ee2e
      Manfred Spraul authored
      
      
      The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
      always (since 0.99.10) reported a thread as sleeping on all semaphores
      that are listed in the semop() call.
      
      The documented behavior (both in the Linux man page and in the Single
      Unix Specification) is that a task should be reported on exactly one
      semaphore: The semaphore that caused the thread to got to sleep.
      
      This patch adds a pr_info_once() that is triggered if a thread hits the
      relevant case.
      
      The code triggers slightly too often, otherwise it would be necessary to
      replicate the old code.  As there are no known users of GETNCNT or
      GETZCNT, this is done to prevent unnecessary bloat.
      
      The task that triggered is reported with name (tsk->comm) and pid.
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9b44ee2e
    • Manfred Spraul's avatar
      ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliant · b220c57a
      Manfred Spraul authored
      
      
      SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
      waits on exactly one semaphore: The semaphore from the first operation
      in the sop array that cannot proceed.
      
      The Linux implementation never followed the standard, it tried to count
      all semaphores that might be the reason why a task sleeps.
      
      This patch fixes that.
      
      Note:
      a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
         therefore the code counts them only on demand.
         (If they wouldn't be rare, then the non-compliance would have
         been found earlier)
      
      b) compared to the initial version of the patch, the BUG_ONs were removed
         and it was clarified that the new behavior conforms to SUS.
      
      Back-compatibility concerns:
      
      Manfred:
      
      : - there is no application in Fedora that uses GETNCNT or GETZCNT.
      :
      : - application that use only single-sop semop() are also safe, the
      :   difference only affects complex apps.
      :
      : - portable application are also safe, the new behavior is standard
      :   compliant.
      :
      : But that's it.  The old behavior existed in Linux from 0.99.something
      : until now.
      
      Michael:
      
      : * These operations seem to be very little used.  Grepping the public
      :   source that is contained Fedora 20 source DVD, there appear to be no
      :   uses.  Of course, this says nothing about uses in private /
      :   non-mainstream FOSS code, but it seems likely that the same pattern
      :   is followed there.
      :
      : * The existing behavior is hard enough to understand that I suspect
      :   that no one understood it well enough to rely on it anyway
      :   (especially as that behavior contradicted both man page and POSIX).
      :
      : So, there's a chance of breakage, but I estimate that it's minute.
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b220c57a
Loading