Skip to content
  1. Jan 14, 2011
  2. Jan 13, 2011
  3. Jan 07, 2011
    • Nicholas Piggin's avatar
      fs: dcache reduce branches in lookup path · fb045adb
      Nicholas Piggin authored
      
      
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      fb045adb
    • Nicholas Piggin's avatar
      fs: dcache rationalise dget variants · dc0474be
      Nicholas Piggin authored
      
      
      dget_locked was a shortcut to avoid the lazy lru manipulation when we already
      held dcache_lock (lru manipulation was relatively cheap at that point).
      However, how that the lru lock is an innermost one, we never hold it at any
      caller, so the lock cost can now be avoided. We already have well working lazy
      dcache LRU, so it should be fine to defer LRU manipulations to scan time.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      dc0474be
    • Nicholas Piggin's avatar
      fs: dcache remove dcache_lock · b5c84bf6
      Nicholas Piggin authored
      
      
      dcache_lock no longer protects anything. remove it.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • Nicholas Piggin's avatar
      fs: dcache scale subdirs · 2fd6b7f5
      Nicholas Piggin authored
      
      
      Protect d_subdirs and d_child with d_lock, except in filesystems that aren't
      using dcache_lock for these anyway (eg. using i_mutex).
      
      Note: if we change the locking rule in future so that ->d_child protection is
      provided only with ->d_parent->d_lock, it may allow us to reduce some locking.
      But it would be an exception to an otherwise regular locking scheme, so we'd
      have to see some good results. Probably not worthwhile.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      2fd6b7f5
    • Nicholas Piggin's avatar
      fs: dcache scale dentry refcount · b7ab39f6
      Nicholas Piggin authored
      
      
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      b7ab39f6
    • Nicholas Piggin's avatar
      fs: change d_delete semantics · fe15ce44
      Nicholas Piggin authored
      
      
      Change d_delete from a dentry deletion notification to a dentry caching
      advise, more like ->drop_inode. Require it to be constant and idempotent,
      and not take d_lock. This is how all existing filesystems use the callback
      anyway.
      
      This makes fine grained dentry locking of dput and dentry lru scanning
      much simpler.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      fe15ce44
    • Nicholas Piggin's avatar
      cgroup fs: avoid switching ->d_op on live dentry · 5adcee1d
      Nicholas Piggin authored
      
      
      Switching d_op on a live dentry is racy in general, so avoid it. In this case
      it is a negative dentry, which is safer, but there are still concurrent ops
      which may be called on d_op in that case (eg. d_revalidate). So in general
      a filesystem may not do this. Fix cgroupfs so as not to do this.
      
      Signed-off-by: default avatarNick Piggin <npiggin@kernel.dk>
      5adcee1d
  4. Oct 29, 2010
  5. Oct 28, 2010
    • Evgeny Kuznetsov's avatar
      cgroups: add check for strcpy destination string overflow · f4a2589f
      Evgeny Kuznetsov authored
      
      
      Function "strcpy" is used without check for maximum allowed source string
      length and could cause destination string overflow.  Check for string
      length is added before using "strcpy".  Function now is return error if
      source string length is more than a maximum.
      
      akpm: presently considered NotABug, but add the check for general
      future-safeness and robustness.
      
      Signed-off-by: default avatarEvgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4a2589f
    • Daniel Lezcano's avatar
      cgroup: make the mount options parsing more accurate · 32a8cf23
      Daniel Lezcano authored
      
      
      Current behavior:
      =================
      
      (1) When we mount a cgroup, we can specify the 'all' option which
          means to enable all the cgroup subsystems.  This is the default option
          when no option is specified.
      
      (2) If we want to mount a cgroup with a subset of the supported cgroup
          subsystems, we have to specify a subsystems name list for the mount
          option.
      
      (3) If we specify another option like 'noprefix' or 'release_agent',
          the actual code wants the 'all' or a subsystem name option specified
          also.  Not critical but a bit not friendly as we should assume (1) in
          this case.
      
      (4) Logically, the 'all' option is mutually exclusive with a subsystem
          name, but this is not detected.
      
      In other words:
       succeed : mount -t cgroup -o all,freezer cgroup /cgroup
      	=> is it 'all' or 'freezer' ?
       fails : mount -t cgroup -o noprefix cgroup /cgroup
      	=> succeed if we do '-o noprefix,all'
      
      The following patches consolidate a bit the mount options check.
      
      New behavior:
      =============
      
      (1) untouched
      (2) untouched
      (3) the 'all' option will be by default when specifying other than
          a subsystem name option
      (4) raises an error
      
      In other words:
       fails   : mount -t cgroup -o all,freezer cgroup /cgroup
       succeed : mount -t cgroup -o noprefix cgroup /cgroup
      
      For the sake of lisibility, the if ... then ... else ... if ...
      indentation when parsing the options has been changed to:
      if ... then
      	...
      	continue
      fi
      
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarSerge E. Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: default avatarPaul Menage <menage@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32a8cf23
    • Daniel Lezcano's avatar
      cgroup: add clone_children control file · 97978e6d
      Daniel Lezcano authored
      The ns_cgroup is a control group interacting with the namespaces.  When a
      new namespace is created, a corresponding cgroup is automatically created
      too.  The cgroup name is the pid of the process who did 'unshare' or the
      child of 'clone'.
      
      This cgroup is tied with the namespace because it prevents a process to
      escape the control group and use the post_clone callback, so the child
      cgroup inherits the values of the parent cgroup.
      
      Unfortunately, the more we use this cgroup and the more we are facing
      problems with it:
      
      (1) when a process unshares, the cgroup name may conflict with a
          previous cgroup with the same pid, so unshare or clone return -EEXIST
      
      (2) the cgroup creation is out of control because there may have an
          application creating several namespaces where the system will
          automatically create several cgroups in his back and let them on the
          cgroupfs (eg.  a vrf based on the network namespace).
      
      (3) the mix of (1) and (2) force an administrator to regularly check
          and clean these cgroups.
      
      This patchset removes the ns_cgroup by adding a new flag to the cgroup and
      the cgroupfs mount option.  It enables the copy of the parent cgroup when
      a child cgroup is created.  We can then safely remove the ns_cgroup as
      this flag brings a compatibility.  We have now to manually create and add
      the task to a cgroup, which is consistent with the cgroup framework.
      
      This patch:
      
      Sent as an answer to a previous thread around the ns_cgroup.
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html
      
      
      
      It adds a control file 'clone_children' for a cgroup.  This control file
      is a boolean specifying if the child cgroup should be a clone of the
      parent cgroup or not.  The default value is 'false'.
      
      This flag makes the child cgroup to call the post_clone callback of all
      the subsystem, if it is available.
      
      At present, the cpuset is the only one which had implemented the
      post_clone callback.
      
      The option can be set at mount time by specifying the 'clone_children'
      mount option.
      
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: default avatarSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Reviewed-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97978e6d
  6. Oct 26, 2010
    • Christoph Hellwig's avatar
      fs: do not assign default i_ino in new_inode · 85fe4025
      Christoph Hellwig authored
      
      
      Instead of always assigning an increasing inode number in new_inode
      move the call to assign it into those callers that actually need it.
      For now callers that need it is estimated conservatively, that is
      the call is added to all filesystems that do not assign an i_ino
      by themselves.  For a few more filesystems we can avoid assigning
      any inode number given that they aren't user visible, and for others
      it could be done lazily when an inode number is actually needed,
      but that's left for later patches.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      85fe4025
  7. Oct 04, 2010
    • Jan Blunck's avatar
      BKL: Remove BKL from cgroup · 38d018db
      Jan Blunck authored
      
      
      The BKL is only used in remount_fs and get_sb that are both protected by
      the superblocks s_umount rw_semaphore. Therefore it is safe to remove the
      BKL entirely.
      
      Signed-off-by: default avatarJan Blunck <jblunck@infradead.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      38d018db
    • Jan Blunck's avatar
      BKL: Explicitly add BKL around get_sb/fill_super · db719222
      Jan Blunck authored
      
      
      This patch is a preparation necessary to remove the BKL from do_new_mount().
      It explicitly adds calls to lock_kernel()/unlock_kernel() around
      get_sb/fill_super operations for filesystems that still uses the BKL.
      
      I've read through all the code formerly covered by the BKL inside
      do_kern_mount() and have satisfied myself that it doesn't need the BKL
      any more.
      
      do_kern_mount() is already called without the BKL when mounting the rootfs
      and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called
      from various places without BKL: simple_pin_fs(), nfs_do_clone_mount()
      through nfs_follow_mountpoint(), afs_mntpt_do_automount() through
      afs_mntpt_follow_link(). Both later functions are actually the filesystems
      follow_link inode operation. vfs_kern_mount() is calling the specified
      get_sb function and lets the filesystem do its job by calling the given
      fill_super function.
      
      Therefore I think it is safe to push down the BKL from the VFS to the
      low-level filesystems get_sb/fill_super operation.
      
      [arnd: do not add the BKL to those file systems that already
             don't use it elsewhere]
      
      Signed-off-by: default avatarJan Blunck <jblunck@infradead.org>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      db719222
  8. Sep 10, 2010
  9. Sep 05, 2010
    • Michael S. Tsirkin's avatar
      cgroups: fix API thinko · 73457f0f
      Michael S. Tsirkin authored
      
      
      cgroup_attach_task_current_cg API that have upstream is backwards: we
      really need an API to attach to the cgroups from another process A to
      the current one.
      
      In our case (vhost), a priveledged user wants to attach it's task to cgroups
      from a less priveledged one, the API makes us run it in the other
      task's context, and this fails.
      
      So let's make the API generic and just pass in 'from' and 'to' tasks.
      Add an inline wrapper for cgroup_attach_task_current_cg to avoid
      breaking bisect.
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      73457f0f
  10. Aug 20, 2010
  11. Aug 11, 2010
  12. Aug 05, 2010
  13. Jul 28, 2010
  14. Jun 04, 2010
    • Greg Thelen's avatar
      cgroups: alloc_css_id() increments hierarchy depth · 94b3dd0f
      Greg Thelen authored
      
      
      Child groups should have a greater depth than their parents.  Prior to
      this change, the parent would incorrectly report zero memory usage for
      child cgroups when use_hierarchy is enabled.
      
      test script:
        mount -t cgroup none /cgroups -o memory
        cd /cgroups
        mkdir cg1
      
        echo 1 > cg1/memory.use_hierarchy
        mkdir cg1/cg11
      
        echo $$ > cg1/cg11/tasks
        dd if=/dev/zero of=/tmp/foo bs=1M count=1
      
        echo
        echo CHILD
        grep cache cg1/cg11/memory.stat
      
        echo
        echo PARENT
        grep cache cg1/memory.stat
      
        echo $$ > tasks
        rmdir cg1/cg11 cg1
        cd /
        umount /cgroups
      
      Using fae9c791, a recent patch that changed alloc_css_id() depth computation,
      the parent incorrectly reports zero usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0151844 s, 69.1 MB/s
      
        CHILD
        cache 1048576
        total_cache 1048576
      
        PARENT
        cache 0
        total_cache 0
      
      With this patch, the parent correctly includes child usage:
        root@ubuntu:~# ./test
        1+0 records in
        1+0 records out
        1048576 bytes (1.0 MB) copied, 0.0136827 s, 76.6 MB/s
      
        CHILD
        cache 1052672
        total_cache 1052672
      
        PARENT
        cache 0
        total_cache 1052672
      
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94b3dd0f
  15. May 27, 2010
  16. May 12, 2010
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix css_is_ancestor() RCU locking · 747388d7
      KAMEZAWA Hiroyuki authored
      
      
      Some callers (in memcontrol.c) calls css_is_ancestor() without
      rcu_read_lock.  Because css_is_ancestor() has to access RCU protected
      data, it should be under rcu_read_lock().
      
      This makes css_is_ancestor() itself does safe access to RCU protected
      area.  (At least, "root" can have refcnt==0 if it's not an ancestor of
      "child".  So, we need rcu_read_lock().)
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      747388d7
    • KAMEZAWA Hiroyuki's avatar
      memcg: fix css_id() RCU locking for real · 7f0f1546
      KAMEZAWA Hiroyuki authored
      
      
      Commit ad4ba375 ("memcg: css_id() must be
      called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
      message.  But Andrew Morton pointed out that the fix doesn't seems sane
      and it was just for hidining lockdep messages.
      
      This is a patch for do proper things.  Checking again, all places,
      accessing without rcu_read_lock, that commit fixies was intentional....
      all callers of css_id() has reference count on it.  So, it's not necessary
      to be under rcu_read_lock().
      
      Considering again, we can use rcu_dereference_check for css_id().  We know
      css->id is valid if css->refcnt > 0.  (css->id never changes and freed
      after css->refcnt going to be 0.)
      
      This patch makes use of rcu_dereference_check() in css_id/depth and remove
      unnecessary rcu-read-lock added by the commit.
      
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f0f1546
  17. May 11, 2010
    • Changli Gao's avatar
      sched, wait: Use wrapper functions · a93d2f17
      Changli Gao authored
      
      
      epoll should not touch flags in wait_queue_t. This patch introduces a new
      function __add_wait_queue_exclusive(), for the users, who use wait queue as a
      LIFO queue.
      
      __add_wait_queue_tail_exclusive() is introduced too instead of
      add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
      it is a duplicate of __remove_wait_queue(), disliked by users, and with less
      users.
      
      Signed-off-by: default avatarChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: <containers@lists.linux-foundation.org>
      LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      a93d2f17
  18. May 04, 2010
  19. Mar 24, 2010
  20. Mar 16, 2010
  21. Mar 12, 2010
    • Kirill A. Shutemov's avatar
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov authored
      
      
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • Kirill A. Shutemov's avatar
      cgroups: fix race between userspace and kernelspace · 4ab78683
      Kirill A. Shutemov authored
      
      
      Notify userspace about cgroup removing only after rmdir of cgroup
      directory to avoid race between userspace and kernelspace.
      
      eventfd are used to notify about two types of event:
       - control file-specific, like crossing memory threshold;
       - cgroup removing.
      
      To understand what really happen, userspace can check if the cgroup still
      exists.  To avoid race beetween userspace and kernelspace we have to
      notify userspace about cgroup removing only after rmdir of cgroup
      directory.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ab78683
    • Kirill A. Shutemov's avatar
      cgroup: implement eventfd-based generic API for notifications · 0dea1168
      Kirill A. Shutemov authored
      
      
      This patchset introduces eventfd-based API for notifications in cgroups
      and implements memory notifications on top of it.
      
      It uses statistics in memory controler to track memory usage.
      
      Output of time(1) on building kernel on tmpfs:
      
      Root cgroup before changes:
      	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
      Non-root cgroup before changes:
      	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
      Root cgroup after changes (0 thresholds):
      	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
      Non-root cgroup after changes (0 thresholds):
      	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
      Root cgroup after changes (1 thresholds, never crossed):
      	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
      Non-root cgroup after changes (1 thresholds, never crossed):
      	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total
      
      This patch:
      
      Introduce the write-only file "cgroup.event_control" in every cgroup.
      
      To register new notification handler you need:
      - create an eventfd;
      - open a control file to be monitored. Callbacks register_event() and
        unregister_event() must be defined for the control file;
      - write "<event_fd> <control_fd> <args>" to cgroup.event_control.
        Interpretation of args is defined by control file implementation;
      
      eventfd will be woken up by control file implementation or when the
      cgroup is removed.
      
      To unregister notification handler just close eventfd.
      
      If you need notification functionality for a control file you have to
      implement callbacks register_event() and unregister_event() in the
      struct cftype.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
      Signed-off-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dea1168
    • Li Zefan's avatar
      cgroups: clean up cgroup_pidlist_find() a bit · b70cc5fd
      Li Zefan authored
      
      
      Don't call get_pid_ns() before we locate/alloc the ns.
      
      Signed-off-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Acked-by: default avatarPaul Menage <menage@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b70cc5fd
    • Ben Blum's avatar
      cgroups: blkio subsystem as module · 67523c48
      Ben Blum authored
      
      
      Modify the Block I/O cgroup subsystem to be able to be built as a module.
      As the CFQ disk scheduler optionally depends on blk-cgroup, config options
      in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
      enhanced to support the new module dependency.
      
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67523c48
    • Ben Blum's avatar
      cgroups: subsystem module unloading · cf5d5941
      Ben Blum authored
      
      
      Provides support for unloading modular subsystems.
      
      This patch adds a new function cgroup_unload_subsys which is to be used
      for removing a loaded subsystem during module deletion.  Reference
      counting of the subsystems' modules is moved from once (at load time) to
      once per attached hierarchy (in parse_cgroupfs_options and
      rebind_subsystems) (i.e., 0 or 1).
      
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf5d5941
    • Ben Blum's avatar
      cgroups: subsystem module loading interface · e6a1105b
      Ben Blum authored
      
      
      Add interface between cgroups subsystem management and module loading
      
      This patch implements rudimentary module-loading support for cgroups -
      namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
      module initcall, and a struct module pointer in struct cgroup_subsys.
      
      Several functions that might be wanted by modules have had EXPORT_SYMBOL
      added to them, but it's unclear exactly which functions want it and which
      won't.
      
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e6a1105b
    • Ben Blum's avatar
      cgroups: revamp subsys array · aae8aab4
      Ben Blum authored
      
      
      This patch series provides the ability for cgroup subsystems to be
      compiled as modules both within and outside the kernel tree.  This is
      mainly useful for classifiers and subsystems that hook into components
      that are already modules.  cls_cgroup and blkio-cgroup serve as the
      example use cases for this feature.
      
      It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
      which modular subsystems can use to register and depart during runtime.
      The net_cls classifier subsystem serves as the example for a subsystem
      which can be converted into a module using these changes.
      
      Patch #1 sets up the subsys[] array so its contents can be dynamic as
      modules appear and (eventually) disappear.  Iterations over the array are
      modified to handle when subsystems are absent, and the dynamic section of
      the array is protected by cgroup_mutex.
      
      Patch #2 implements an interface for modules to load subsystems, called
      cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
      pointer in struct cgroup_subsys.
      
      Patch #3 adds a mechanism for unloading modular subsystems, which includes
      a more advanced rework of the rudimentary reference counting introduced in
      patch 2.
      
      Patch #4 modifies the net_cls subsystem, which already had some module
      declarations, to be configurable as a module, which also serves as a
      simple proof-of-concept.
      
      Part of implementing patches 2 and 4 involved updating css pointers in
      each css_set when the module appears or leaves.  In doing this, it was
      discovered that css_sets always remain linked to the dummy cgroup,
      regardless of whether or not any subsystems are actually bound to it
      (i.e., not mounted on an actual hierarchy).  The subsystem loading and
      unloading code therefore should keep in mind the special cases where the
      added subsystem is the only one in the dummy cgroup (and therefore all
      css_sets need to be linked back into it) and where the removed subsys was
      the only one in the dummy cgroup (and therefore all css_sets should be
      unlinked from it) - however, as all css_sets always stay attached to the
      dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
      issue should also make sure these cases are addressed in the subsystem
      loading and unloading code.
      
      This patch:
      
      Make subsys[] able to be dynamically populated to support modular
      subsystems
      
      This patch reworks the way the subsys[] array is used so that subsystems
      can register themselves after boot time, and enables the internals of
      cgroups to be able to handle when subsystems are not present or may
      appear/disappear.
      
      Signed-off-by: default avatarBen Blum <bblum@andrew.cmu.edu>
      Acked-by: default avatarLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aae8aab4
Loading