diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html deleted file mode 100644 index c30c1957c7e6b866878d49d879f202ca3945813f..0000000000000000000000000000000000000000 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ /dev/null @@ -1,1391 +0,0 @@ - - - A Tour Through TREE_RCU's Data Structures [LWN.net] - - -

December 18, 2016

-

This article was contributed by Paul E. McKenney

- -

Introduction

- -This document describes RCU's major data structures and their relationship -to each other. - -
    -
  1. - Data-Structure Relationships -
  2. - The rcu_state Structure -
  3. - The rcu_node Structure -
  4. - The rcu_segcblist Structure -
  5. - The rcu_data Structure -
  6. - The rcu_head Structure -
  7. - RCU-Specific Fields in the task_struct Structure -
  8. - Accessor Functions -
- -

Data-Structure Relationships

- -

RCU is for all intents and purposes a large state machine, and its -data structures maintain the state in such a way as to allow RCU readers -to execute extremely quickly, while also processing the RCU grace periods -requested by updaters in an efficient and extremely scalable fashion. -The efficiency and scalability of RCU updaters is provided primarily -by a combining tree, as shown below: - -

BigTreeClassicRCU.svg - -

This diagram shows an enclosing rcu_state structure -containing a tree of rcu_node structures. -Each leaf node of the rcu_node tree has up to 16 -rcu_data structures associated with it, so that there -are NR_CPUS number of rcu_data structures, -one for each possible CPU. -This structure is adjusted at boot time, if needed, to handle the -common case where nr_cpu_ids is much less than -NR_CPUs. -For example, a number of Linux distributions set NR_CPUs=4096, -which results in a three-level rcu_node tree. -If the actual hardware has only 16 CPUs, RCU will adjust itself -at boot time, resulting in an rcu_node tree with only a single node. - -

The purpose of this combining tree is to allow per-CPU events -such as quiescent states, dyntick-idle transitions, -and CPU hotplug operations to be processed efficiently -and scalably. -Quiescent states are recorded by the per-CPU rcu_data structures, -and other events are recorded by the leaf-level rcu_node -structures. -All of these events are combined at each level of the tree until finally -grace periods are completed at the tree's root rcu_node -structure. -A grace period can be completed at the root once every CPU -(or, in the case of CONFIG_PREEMPT_RCU, task) -has passed through a quiescent state. -Once a grace period has completed, record of that fact is propagated -back down the tree. - -

As can be seen from the diagram, on a 64-bit system -a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout -of 64 at the root and a fanout of 16 at the leaves. - - - - - - - - -
 
Quick Quiz:
- Why isn't the fanout at the leaves also 64? -
Answer:
- Because there are more types of events that affect the leaf-level - rcu_node structures than further up the tree. - Therefore, if the leaf rcu_node structures have fanout of - 64, the contention on these structures' ->structures - becomes excessive. - Experimentation on a wide variety of systems has shown that a fanout - of 16 works well for the leaves of the rcu_node tree. - - -

Of course, further experience with - systems having hundreds or thousands of CPUs may demonstrate - that the fanout for the non-leaf rcu_node structures - must also be reduced. - Such reduction can be easily carried out when and if it proves - necessary. - In the meantime, if you are using such a system and running into - contention problems on the non-leaf rcu_node structures, - you may use the CONFIG_RCU_FANOUT kernel configuration - parameter to reduce the non-leaf fanout as needed. - - -

Kernels built for systems with - strong NUMA characteristics might also need to adjust - CONFIG_RCU_FANOUT so that the domains of the - rcu_node structures align with hardware boundaries. - However, there has thus far been no need for this. -

 
- -

If your system has more than 1,024 CPUs (or more than 512 CPUs on -a 32-bit system), then RCU will automatically add more levels to the -tree. -For example, if you are crazy enough to build a 64-bit system with 65,536 -CPUs, RCU would configure the rcu_node tree as follows: - -

HugeTreeClassicRCU.svg - -

RCU currently permits up to a four-level tree, which on a 64-bit system -accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for -32-bit systems. -On the other hand, you can set both CONFIG_RCU_FANOUT and -CONFIG_RCU_FANOUT_LEAF to be as small as 2, which would result -in a 16-CPU test using a 4-level tree. -This can be useful for testing large-system capabilities on small test -machines. - -

This multi-level combining tree allows us to get most of the -performance and scalability -benefits of partitioning, even though RCU grace-period detection is -inherently a global operation. -The trick here is that only the last CPU to report a quiescent state -into a given rcu_node structure need advance to the rcu_node -structure at the next level up the tree. -This means that at the leaf-level rcu_node structure, only -one access out of sixteen will progress up the tree. -For the internal rcu_node structures, the situation is even -more extreme: Only one access out of sixty-four will progress up -the tree. -Because the vast majority of the CPUs do not progress up the tree, -the lock contention remains roughly constant up the tree. -No matter how many CPUs there are in the system, at most 64 quiescent-state -reports per grace period will progress all the way to the root -rcu_node structure, thus ensuring that the lock contention -on that root rcu_node structure remains acceptably low. - -

In effect, the combining tree acts like a big shock absorber, -keeping lock contention under control at all tree levels regardless -of the level of loading on the system. - -

RCU updaters wait for normal grace periods by registering -RCU callbacks, either directly via call_rcu() -or indirectly via synchronize_rcu() and friends. -RCU callbacks are represented by rcu_head structures, -which are queued on rcu_data structures while they are -waiting for a grace period to elapse, as shown in the following figure: - -

BigTreePreemptRCUBHdyntickCB.svg - -

This figure shows how TREE_RCU's and -PREEMPT_RCU's major data structures are related. -Lesser data structures will be introduced with the algorithms that -make use of them. - -

Note that each of the data structures in the above figure has -its own synchronization: - -

    -
  1. Each rcu_state structures has a lock and a mutex, - and some fields are protected by the corresponding root - rcu_node structure's lock. -
  2. Each rcu_node structure has a spinlock. -
  3. The fields in rcu_data are private to the corresponding - CPU, although a few can be read and written by other CPUs. -
- -

It is important to note that different data structures can have -very different ideas about the state of RCU at any given time. -For but one example, awareness of the start or end of a given RCU -grace period propagates slowly through the data structures. -This slow propagation is absolutely necessary for RCU to have good -read-side performance. -If this balkanized implementation seems foreign to you, one useful -trick is to consider each instance of these data structures to be -a different person, each having the usual slightly different -view of reality. - -

The general role of each of these data structures is as -follows: - -

    -
  1. rcu_state: - This structure forms the interconnection between the - rcu_node and rcu_data structures, - tracks grace periods, serves as short-term repository - for callbacks orphaned by CPU-hotplug events, - maintains rcu_barrier() state, - tracks expedited grace-period state, - and maintains state used to force quiescent states when - grace periods extend too long, -
  2. rcu_node: This structure forms the combining - tree that propagates quiescent-state - information from the leaves to the root, and also propagates - grace-period information from the root to the leaves. - It provides local copies of the grace-period state in order - to allow this information to be accessed in a synchronized - manner without suffering the scalability limitations that - would otherwise be imposed by global locking. - In CONFIG_PREEMPT_RCU kernels, it manages the lists - of tasks that have blocked while in their current - RCU read-side critical section. - In CONFIG_PREEMPT_RCU with - CONFIG_RCU_BOOST, it manages the - per-rcu_node priority-boosting - kernel threads (kthreads) and state. - Finally, it records CPU-hotplug state in order to determine - which CPUs should be ignored during a given grace period. -
  3. rcu_data: This per-CPU structure is the - focus of quiescent-state detection and RCU callback queuing. - It also tracks its relationship to the corresponding leaf - rcu_node structure to allow more-efficient - propagation of quiescent states up the rcu_node - combining tree. - Like the rcu_node structure, it provides a local - copy of the grace-period information to allow for-free - synchronized - access to this information from the corresponding CPU. - Finally, this structure records past dyntick-idle state - for the corresponding CPU and also tracks statistics. -
  4. rcu_head: - This structure represents RCU callbacks, and is the - only structure allocated and managed by RCU users. - The rcu_head structure is normally embedded - within the RCU-protected data structure. -
- -

If all you wanted from this article was a general notion of how -RCU's data structures are related, you are done. -Otherwise, each of the following sections give more details on -the rcu_state, rcu_node and rcu_data data -structures. - -

-The rcu_state Structure

- -

The rcu_state structure is the base structure that -represents the state of RCU in the system. -This structure forms the interconnection between the -rcu_node and rcu_data structures, -tracks grace periods, contains the lock used to -synchronize with CPU-hotplug events, -and maintains state used to force quiescent states when -grace periods extend too long, - -

A few of the rcu_state structure's fields are discussed, -singly and in groups, in the following sections. -The more specialized fields are covered in the discussion of their -use. - -

Relationship to rcu_node and rcu_data Structures
- -This portion of the rcu_state structure is declared -as follows: - -
-  1   struct rcu_node node[NUM_RCU_NODES];
-  2   struct rcu_node *level[NUM_RCU_LVLS + 1];
-  3   struct rcu_data __percpu *rda;
-
- - - - - - - - -
 
Quick Quiz:
- Wait a minute! - You said that the rcu_node structures formed a tree, - but they are declared as a flat array! - What gives? -
Answer:
- The tree is laid out in the array. - The first node In the array is the head, the next set of nodes in the - array are children of the head node, and so on until the last set of - nodes in the array are the leaves. - - -

See the following diagrams to see how - this works. -

 
- -

The rcu_node tree is embedded into the -->node[] array as shown in the following figure: - -

TreeMapping.svg - -

One interesting consequence of this mapping is that a -breadth-first traversal of the tree is implemented as a simple -linear scan of the array, which is in fact what the -rcu_for_each_node_breadth_first() macro does. -This macro is used at the beginning and ends of grace periods. - -

Each entry of the ->level array references -the first rcu_node structure on the corresponding level -of the tree, for example, as shown below: - -

TreeMappingLevel.svg - -

The zeroth element of the array references the root -rcu_node structure, the first element references the -first child of the root rcu_node, and finally the second -element references the first leaf rcu_node structure. - -

For whatever it is worth, if you draw the tree to be tree-shaped -rather than array-shaped, it is easy to draw a planar representation: - -

TreeLevel.svg - -

Finally, the ->rda field references a per-CPU -pointer to the corresponding CPU's rcu_data structure. - -

All of these fields are constant once initialization is complete, -and therefore need no protection. - -

Grace-Period Tracking
- -

This portion of the rcu_state structure is declared -as follows: - -

-  1   unsigned long gp_seq;
-
- -

RCU grace periods are numbered, and -the ->gp_seq field contains the current grace-period -sequence number. -The bottom two bits are the state of the current grace period, -which can be zero for not yet started or one for in progress. -In other words, if the bottom two bits of ->gp_seq are -zero, then RCU is idle. -Any other value in the bottom two bits indicates that something is broken. -This field is protected by the root rcu_node structure's -->lock field. - -

There are ->gp_seq fields -in the rcu_node and rcu_data structures -as well. -The fields in the rcu_state structure represent the -most current value, and those of the other structures are compared -in order to detect the beginnings and ends of grace periods in a distributed -fashion. -The values flow from rcu_state to rcu_node -(down the tree from the root to the leaves) to rcu_data. - -

Miscellaneous
- -

This portion of the rcu_state structure is declared -as follows: - -

-  1   unsigned long gp_max;
-  2   char abbr;
-  3   char *name;
-
- -

The ->gp_max field tracks the duration of the longest -grace period in jiffies. -It is protected by the root rcu_node's ->lock. - -

The ->name and ->abbr fields distinguish -between preemptible RCU (“rcu_preempt” and “p”) -and non-preemptible RCU (“rcu_sched” and “s”). -These fields are used for diagnostic and tracing purposes. - -

-The rcu_node Structure

- -

The rcu_node structures form the combining -tree that propagates quiescent-state -information from the leaves to the root and also that propagates -grace-period information from the root down to the leaves. -They provides local copies of the grace-period state in order -to allow this information to be accessed in a synchronized -manner without suffering the scalability limitations that -would otherwise be imposed by global locking. -In CONFIG_PREEMPT_RCU kernels, they manage the lists -of tasks that have blocked while in their current -RCU read-side critical section. -In CONFIG_PREEMPT_RCU with -CONFIG_RCU_BOOST, they manage the -per-rcu_node priority-boosting -kernel threads (kthreads) and state. -Finally, they record CPU-hotplug state in order to determine -which CPUs should be ignored during a given grace period. - -

The rcu_node structure's fields are discussed, -singly and in groups, in the following sections. - -

Connection to Combining Tree
- -

This portion of the rcu_node structure is declared -as follows: - -

-  1   struct rcu_node *parent;
-  2   u8 level;
-  3   u8 grpnum;
-  4   unsigned long grpmask;
-  5   int grplo;
-  6   int grphi;
-
- -

The ->parent pointer references the rcu_node -one level up in the tree, and is NULL for the root -rcu_node. -The RCU implementation makes heavy use of this field to push quiescent -states up the tree. -The ->level field gives the level in the tree, with -the root being at level zero, its children at level one, and so on. -The ->grpnum field gives this node's position within -the children of its parent, so this number can range between 0 and 31 -on 32-bit systems and between 0 and 63 on 64-bit systems. -The ->level and ->grpnum fields are -used only during initialization and for tracing. -The ->grpmask field is the bitmask counterpart of -->grpnum, and therefore always has exactly one bit set. -This mask is used to clear the bit corresponding to this rcu_node -structure in its parent's bitmasks, which are described later. -Finally, the ->grplo and ->grphi fields -contain the lowest and highest numbered CPU served by this -rcu_node structure, respectively. - -

All of these fields are constant, and thus do not require any -synchronization. - -

Synchronization
- -

This field of the rcu_node structure is declared -as follows: - -

-  1   raw_spinlock_t lock;
-
- -

This field is used to protect the remaining fields in this structure, -unless otherwise stated. -That said, all of the fields in this structure can be accessed without -locking for tracing purposes. -Yes, this can result in confusing traces, but better some tracing confusion -than to be heisenbugged out of existence. - -

Grace-Period Tracking
- -

This portion of the rcu_node structure is declared -as follows: - -

-  1   unsigned long gp_seq;
-  2   unsigned long gp_seq_needed;
-
- -

The rcu_node structures' ->gp_seq fields are -the counterparts of the field of the same name in the rcu_state -structure. -They each may lag up to one step behind their rcu_state -counterpart. -If the bottom two bits of a given rcu_node structure's -->gp_seq field is zero, then this rcu_node -structure believes that RCU is idle. -

The >gp_seq field of each rcu_node -structure is updated at the beginning and the end -of each grace period. - -

The ->gp_seq_needed fields record the -furthest-in-the-future grace period request seen by the corresponding -rcu_node structure. The request is considered fulfilled when -the value of the ->gp_seq field equals or exceeds that of -the ->gp_seq_needed field. - - - - - - - - -
 
Quick Quiz:
- Suppose that this rcu_node structure doesn't see - a request for a very long time. - Won't wrapping of the ->gp_seq field cause - problems? -
Answer:
- No, because if the ->gp_seq_needed field lags behind the - ->gp_seq field, the ->gp_seq_needed field - will be updated at the end of the grace period. - Modulo-arithmetic comparisons therefore will always get the - correct answer, even with wrapping. -
 
- -

Quiescent-State Tracking
- -

These fields manage the propagation of quiescent states up the -combining tree. - -

This portion of the rcu_node structure has fields -as follows: - -

-  1   unsigned long qsmask;
-  2   unsigned long expmask;
-  3   unsigned long qsmaskinit;
-  4   unsigned long expmaskinit;
-
- -

The ->qsmask field tracks which of this -rcu_node structure's children still need to report -quiescent states for the current normal grace period. -Such children will have a value of 1 in their corresponding bit. -Note that the leaf rcu_node structures should be -thought of as having rcu_data structures as their -children. -Similarly, the ->expmask field tracks which -of this rcu_node structure's children still need to report -quiescent states for the current expedited grace period. -An expedited grace period has -the same conceptual properties as a normal grace period, but the -expedited implementation accepts extreme CPU overhead to obtain -much lower grace-period latency, for example, consuming a few -tens of microseconds worth of CPU time to reduce grace-period -duration from milliseconds to tens of microseconds. -The ->qsmaskinit field tracks which of this -rcu_node structure's children cover for at least -one online CPU. -This mask is used to initialize ->qsmask, -and ->expmaskinit is used to initialize -->expmask and the beginning of the -normal and expedited grace periods, respectively. - - - - - - - - -
 
Quick Quiz:
- Why are these bitmasks protected by locking? - Come on, haven't you heard of atomic instructions??? -
Answer:
- Lockless grace-period computation! Such a tantalizing possibility! - - -

But consider the following sequence of events: - - -

    -
  1. CPU 0 has been in dyntick-idle - mode for quite some time. - When it wakes up, it notices that the current RCU - grace period needs it to report in, so it sets a - flag where the scheduling clock interrupt will find it. -

    -

  2. Meanwhile, CPU 1 is running - force_quiescent_state(), - and notices that CPU 0 has been in dyntick idle mode, - which qualifies as an extended quiescent state. -

    -

  3. CPU 0's scheduling clock - interrupt fires in the - middle of an RCU read-side critical section, and notices - that the RCU core needs something, so commences RCU softirq - processing. - -

    -

  4. CPU 0's softirq handler - executes and is just about ready - to report its quiescent state up the rcu_node - tree. -

    -

  5. But CPU 1 beats it to the punch, - completing the current - grace period and starting a new one. -

    -

  6. CPU 0 now reports its quiescent - state for the wrong - grace period. - That grace period might now end before the RCU read-side - critical section. - If that happens, disaster will ensue. - -
- -

So the locking is absolutely required in - order to coordinate clearing of the bits with updating of the - grace-period sequence number in ->gp_seq. -

 
- -

Blocked-Task Management
- -

PREEMPT_RCU allows tasks to be preempted in the -midst of their RCU read-side critical sections, and these tasks -must be tracked explicitly. -The details of exactly why and how they are tracked will be covered -in a separate article on RCU read-side processing. -For now, it is enough to know that the rcu_node -structure tracks them. - -

-  1   struct list_head blkd_tasks;
-  2   struct list_head *gp_tasks;
-  3   struct list_head *exp_tasks;
-  4   bool wait_blkd_tasks;
-
- -

The ->blkd_tasks field is a list header for -the list of blocked and preempted tasks. -As tasks undergo context switches within RCU read-side critical -sections, their task_struct structures are enqueued -(via the task_struct's ->rcu_node_entry -field) onto the head of the ->blkd_tasks list for the -leaf rcu_node structure corresponding to the CPU -on which the outgoing context switch executed. -As these tasks later exit their RCU read-side critical sections, -they remove themselves from the list. -This list is therefore in reverse time order, so that if one of the tasks -is blocking the current grace period, all subsequent tasks must -also be blocking that same grace period. -Therefore, a single pointer into this list suffices to track -all tasks blocking a given grace period. -That pointer is stored in ->gp_tasks for normal -grace periods and in ->exp_tasks for expedited -grace periods. -These last two fields are NULL if either there is -no grace period in flight or if there are no blocked tasks -preventing that grace period from completing. -If either of these two pointers is referencing a task that -removes itself from the ->blkd_tasks list, -then that task must advance the pointer to the next task on -the list, or set the pointer to NULL if there -are no subsequent tasks on the list. - -

For example, suppose that tasks T1, T2, and T3 are -all hard-affinitied to the largest-numbered CPU in the system. -Then if task T1 blocked in an RCU read-side -critical section, then an expedited grace period started, -then task T2 blocked in an RCU read-side critical section, -then a normal grace period started, and finally task 3 blocked -in an RCU read-side critical section, then the state of the -last leaf rcu_node structure's blocked-task list -would be as shown below: - -

blkd_task.svg - -

Task T1 is blocking both grace periods, task T2 is -blocking only the normal grace period, and task T3 is blocking -neither grace period. -Note that these tasks will not remove themselves from this list -immediately upon resuming execution. -They will instead remain on the list until they execute the outermost -rcu_read_unlock() that ends their RCU read-side critical -section. - -

-The ->wait_blkd_tasks field indicates whether or not -the current grace period is waiting on a blocked task. - -

Sizing the rcu_node Array
- -

The rcu_node array is sized via a series of -C-preprocessor expressions as follows: - -

- 1 #ifdef CONFIG_RCU_FANOUT
- 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
- 3 #else
- 4 # ifdef CONFIG_64BIT
- 5 # define RCU_FANOUT 64
- 6 # else
- 7 # define RCU_FANOUT 32
- 8 # endif
- 9 #endif
-10
-11 #ifdef CONFIG_RCU_FANOUT_LEAF
-12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
-13 #else
-14 # ifdef CONFIG_64BIT
-15 # define RCU_FANOUT_LEAF 64
-16 # else
-17 # define RCU_FANOUT_LEAF 32
-18 # endif
-19 #endif
-20
-21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF)
-22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT)
-23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT)
-24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT)
-25
-26 #if NR_CPUS <= RCU_FANOUT_1
-27 #  define RCU_NUM_LVLS        1
-28 #  define NUM_RCU_LVL_0        1
-29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0
-30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 }
-31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" }
-32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" }
-33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" }
-34 #elif NR_CPUS <= RCU_FANOUT_2
-35 #  define RCU_NUM_LVLS        2
-36 #  define NUM_RCU_LVL_0        1
-37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
-39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
-40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" }
-41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" }
-42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" }
-43 #elif NR_CPUS <= RCU_FANOUT_3
-44 #  define RCU_NUM_LVLS        3
-45 #  define NUM_RCU_LVL_0        1
-46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
-47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
-49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
-50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
-51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
-52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
-53 #elif NR_CPUS <= RCU_FANOUT_4
-54 #  define RCU_NUM_LVLS        4
-55 #  define NUM_RCU_LVL_0        1
-56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
-57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
-58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
-60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
-61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
-62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
-63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
-64 #else
-65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
-66 #endif
-
- -

The maximum number of levels in the rcu_node structure -is currently limited to four, as specified by lines 21-24 -and the structure of the subsequent “if” statement. -For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which -should be sufficient for the next few years at least. -For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which -should see us through the next decade or so. -This four-level tree also allows kernels built with -CONFIG_RCU_FANOUT=8 to support up to 4096 CPUs, -which might be useful in very large systems having eight CPUs per -socket (but please note that no one has yet shown any measurable -performance degradation due to misaligned socket and rcu_node -boundaries). -In addition, building kernels with a full four levels of rcu_node -tree permits better testing of RCU's combining-tree code. - -

The RCU_FANOUT symbol controls how many children -are permitted at each non-leaf level of the rcu_node tree. -If the CONFIG_RCU_FANOUT Kconfig option is not specified, -it is set based on the word size of the system, which is also -the Kconfig default. - -

The RCU_FANOUT_LEAF symbol controls how many CPUs are -handled by each leaf rcu_node structure. -Experience has shown that allowing a given leaf rcu_node -structure to handle 64 CPUs, as permitted by the number of bits in -the ->qsmask field on a 64-bit system, results in -excessive contention for the leaf rcu_node structures' -->lock fields. -The number of CPUs per leaf rcu_node structure is therefore -limited to 16 given the default value of CONFIG_RCU_FANOUT_LEAF. -If CONFIG_RCU_FANOUT_LEAF is unspecified, the value -selected is based on the word size of the system, just as for -CONFIG_RCU_FANOUT. -Lines 11-19 perform this computation. - -

Lines 21-24 compute the maximum number of CPUs supported by -a single-level (which contains a single rcu_node structure), -two-level, three-level, and four-level rcu_node tree, -respectively, given the fanout specified by RCU_FANOUT -and RCU_FANOUT_LEAF. -These numbers of CPUs are retained in the -RCU_FANOUT_1, -RCU_FANOUT_2, -RCU_FANOUT_3, and -RCU_FANOUT_4 -C-preprocessor variables, respectively. - -

These variables are used to control the C-preprocessor #if -statement spanning lines 26-66 that computes the number of -rcu_node structures required for each level of the tree, -as well as the number of levels required. -The number of levels is placed in the NUM_RCU_LVLS -C-preprocessor variable by lines 27, 35, 44, and 54. -The number of rcu_node structures for the topmost level -of the tree is always exactly one, and this value is unconditionally -placed into NUM_RCU_LVL_0 by lines 28, 36, 45, and 55. -The rest of the levels (if any) of the rcu_node tree -are computed by dividing the maximum number of CPUs by the -fanout supported by the number of levels from the current level down, -rounding up. This computation is performed by lines 37, -46-47, and 56-58. -Lines 31-33, 40-42, 50-52, and 62-63 create initializers -for lockdep lock-class names. -Finally, lines 64-66 produce an error if the maximum number of -CPUs is too large for the specified fanout. - -

-The rcu_segcblist Structure

- -The rcu_segcblist structure maintains a segmented list of -callbacks as follows: - -
- 1 #define RCU_DONE_TAIL        0
- 2 #define RCU_WAIT_TAIL        1
- 3 #define RCU_NEXT_READY_TAIL  2
- 4 #define RCU_NEXT_TAIL        3
- 5 #define RCU_CBLIST_NSEGS     4
- 6
- 7 struct rcu_segcblist {
- 8   struct rcu_head *head;
- 9   struct rcu_head **tails[RCU_CBLIST_NSEGS];
-10   unsigned long gp_seq[RCU_CBLIST_NSEGS];
-11   long len;
-12   long len_lazy;
-13 };
-
- -

-The segments are as follows: - -

    -
  1. RCU_DONE_TAIL: Callbacks whose grace periods have elapsed. - These callbacks are ready to be invoked. -
  2. RCU_WAIT_TAIL: Callbacks that are waiting for the - current grace period. - Note that different CPUs can have different ideas about which - grace period is current, hence the ->gp_seq field. -
  3. RCU_NEXT_READY_TAIL: Callbacks waiting for the next - grace period to start. -
  4. RCU_NEXT_TAIL: Callbacks that have not yet been - associated with a grace period. -
- -

-The ->head pointer references the first callback or -is NULL if the list contains no callbacks (which is -not the same as being empty). -Each element of the ->tails[] array references the -->next pointer of the last callback in the corresponding -segment of the list, or the list's ->head pointer if -that segment and all previous segments are empty. -If the corresponding segment is empty but some previous segment is -not empty, then the array element is identical to its predecessor. -Older callbacks are closer to the head of the list, and new callbacks -are added at the tail. -This relationship between the ->head pointer, the -->tails[] array, and the callbacks is shown in this -diagram: - -

nxtlist.svg - -

In this figure, the ->head pointer references the -first -RCU callback in the list. -The ->tails[RCU_DONE_TAIL] array element references -the ->head pointer itself, indicating that none -of the callbacks is ready to invoke. -The ->tails[RCU_WAIT_TAIL] array element references callback -CB 2's ->next pointer, which indicates that -CB 1 and CB 2 are both waiting on the current grace period, -give or take possible disagreements about exactly which grace period -is the current one. -The ->tails[RCU_NEXT_READY_TAIL] array element -references the same RCU callback that ->tails[RCU_WAIT_TAIL] -does, which indicates that there are no callbacks waiting on the next -RCU grace period. -The ->tails[RCU_NEXT_TAIL] array element references -CB 4's ->next pointer, indicating that all the -remaining RCU callbacks have not yet been assigned to an RCU grace -period. -Note that the ->tails[RCU_NEXT_TAIL] array element -always references the last RCU callback's ->next pointer -unless the callback list is empty, in which case it references -the ->head pointer. - -

-There is one additional important special case for the -->tails[RCU_NEXT_TAIL] array element: It can be NULL -when this list is disabled. -Lists are disabled when the corresponding CPU is offline or when -the corresponding CPU's callbacks are offloaded to a kthread, -both of which are described elsewhere. - -

CPUs advance their callbacks from the -RCU_NEXT_TAIL to the RCU_NEXT_READY_TAIL to the -RCU_WAIT_TAIL to the RCU_DONE_TAIL list segments -as grace periods advance. - -

The ->gp_seq[] array records grace-period -numbers corresponding to the list segments. -This is what allows different CPUs to have different ideas as to -which is the current grace period while still avoiding premature -invocation of their callbacks. -In particular, this allows CPUs that go idle for extended periods -to determine which of their callbacks are ready to be invoked after -reawakening. - -

The ->len counter contains the number of -callbacks in ->head, and the -->len_lazy contains the number of those callbacks that -are known to only free memory, and whose invocation can therefore -be safely deferred. - -

Important note: It is the ->len field that -determines whether or not there are callbacks associated with -this rcu_segcblist structure, not the ->head -pointer. -The reason for this is that all the ready-to-invoke callbacks -(that is, those in the RCU_DONE_TAIL segment) are extracted -all at once at callback-invocation time (rcu_do_batch), due -to which ->head may be set to NULL if there are no not-done -callbacks remaining in the rcu_segcblist. -If callback invocation must be postponed, for example, because a -high-priority process just woke up on this CPU, then the remaining -callbacks are placed back on the RCU_DONE_TAIL segment and -->head once again points to the start of the segment. -In short, the head field can briefly be NULL even though the -CPU has callbacks present the entire time. -Therefore, it is not appropriate to test the ->head pointer -for NULL. - -

In contrast, the ->len and ->len_lazy counts -are adjusted only after the corresponding callbacks have been invoked. -This means that the ->len count is zero only if -the rcu_segcblist structure really is devoid of callbacks. -Of course, off-CPU sampling of the ->len count requires -careful use of appropriate synchronization, for example, memory barriers. -This synchronization can be a bit subtle, particularly in the case -of rcu_barrier(). - -

-The rcu_data Structure

- -

The rcu_data maintains the per-CPU state for the RCU subsystem. -The fields in this structure may be accessed only from the corresponding -CPU (and from tracing) unless otherwise stated. -This structure is the -focus of quiescent-state detection and RCU callback queuing. -It also tracks its relationship to the corresponding leaf -rcu_node structure to allow more-efficient -propagation of quiescent states up the rcu_node -combining tree. -Like the rcu_node structure, it provides a local -copy of the grace-period information to allow for-free -synchronized -access to this information from the corresponding CPU. -Finally, this structure records past dyntick-idle state -for the corresponding CPU and also tracks statistics. - -

The rcu_data structure's fields are discussed, -singly and in groups, in the following sections. - -

Connection to Other Data Structures
- -

This portion of the rcu_data structure is declared -as follows: - -

-  1   int cpu;
-  2   struct rcu_node *mynode;
-  3   unsigned long grpmask;
-  4   bool beenonline;
-
- -

The ->cpu field contains the number of the -corresponding CPU and the ->mynode field references the -corresponding rcu_node structure. -The ->mynode is used to propagate quiescent states -up the combining tree. -These two fields are constant and therefore do not require synchronization. - -

The ->grpmask field indicates the bit in -the ->mynode->qsmask corresponding to this -rcu_data structure, and is also used when propagating -quiescent states. -The ->beenonline flag is set whenever the corresponding -CPU comes online, which means that the debugfs tracing need not dump -out any rcu_data structure for which this flag is not set. - -

Quiescent-State and Grace-Period Tracking
- -

This portion of the rcu_data structure is declared -as follows: - -

-  1   unsigned long gp_seq;
-  2   unsigned long gp_seq_needed;
-  3   bool cpu_no_qs;
-  4   bool core_needs_qs;
-  5   bool gpwrap;
-
- -

The ->gp_seq field is the counterpart of the field of the same -name in the rcu_state and rcu_node structures. The -->gp_seq_needed field is the counterpart of the field of the same -name in the rcu_node structure. -They may each lag up to one behind their rcu_node -counterparts, but in CONFIG_NO_HZ_IDLE and -CONFIG_NO_HZ_FULL kernels can lag -arbitrarily far behind for CPUs in dyntick-idle mode (but these counters -will catch up upon exit from dyntick-idle mode). -If the lower two bits of a given rcu_data structure's -->gp_seq are zero, then this rcu_data -structure believes that RCU is idle. - - - - - - - - -
 
Quick Quiz:
- All this replication of the grace period numbers can only cause - massive confusion. - Why not just keep a global sequence number and be done with it??? -
Answer:
- Because if there was only a single global sequence - numbers, there would need to be a single global lock to allow - safely accessing and updating it. - And if we are not going to have a single global lock, we need - to carefully manage the numbers on a per-node basis. - Recall from the answer to a previous Quick Quiz that the consequences - of applying a previously sampled quiescent state to the wrong - grace period are quite severe. -
 
- -

The ->cpu_no_qs flag indicates that the -CPU has not yet passed through a quiescent state, -while the ->core_needs_qs flag indicates that the -RCU core needs a quiescent state from the corresponding CPU. -The ->gpwrap field indicates that the corresponding -CPU has remained idle for so long that the -gp_seq counter is in danger of overflow, which -will cause the CPU to disregard the values of its counters on -its next exit from idle. - -

RCU Callback Handling
- -

In the absence of CPU-hotplug events, RCU callbacks are invoked by -the same CPU that registered them. -This is strictly a cache-locality optimization: callbacks can and -do get invoked on CPUs other than the one that registered them. -After all, if the CPU that registered a given callback has gone -offline before the callback can be invoked, there really is no other -choice. - -

This portion of the rcu_data structure is declared -as follows: - -

- 1 struct rcu_segcblist cblist;
- 2 long qlen_last_fqs_check;
- 3 unsigned long n_cbs_invoked;
- 4 unsigned long n_nocbs_invoked;
- 5 unsigned long n_cbs_orphaned;
- 6 unsigned long n_cbs_adopted;
- 7 unsigned long n_force_qs_snap;
- 8 long blimit;
-
- -

The ->cblist structure is the segmented callback list -described earlier. -The CPU advances the callbacks in its rcu_data structure -whenever it notices that another RCU grace period has completed. -The CPU detects the completion of an RCU grace period by noticing -that the value of its rcu_data structure's -->gp_seq field differs from that of its leaf -rcu_node structure. -Recall that each rcu_node structure's -->gp_seq field is updated at the beginnings and ends of each -grace period. - -

-The ->qlen_last_fqs_check and -->n_force_qs_snap coordinate the forcing of quiescent -states from call_rcu() and friends when callback -lists grow excessively long. - -

The ->n_cbs_invoked, -->n_cbs_orphaned, and ->n_cbs_adopted -fields count the number of callbacks invoked, -sent to other CPUs when this CPU goes offline, -and received from other CPUs when those other CPUs go offline. -The ->n_nocbs_invoked is used when the CPU's callbacks -are offloaded to a kthread. - -

-Finally, the ->blimit counter is the maximum number of -RCU callbacks that may be invoked at a given time. - -

Dyntick-Idle Handling
- -

This portion of the rcu_data structure is declared -as follows: - -

-  1   int dynticks_snap;
-  2   unsigned long dynticks_fqs;
-
- -The ->dynticks_snap field is used to take a snapshot -of the corresponding CPU's dyntick-idle state when forcing -quiescent states, and is therefore accessed from other CPUs. -Finally, the ->dynticks_fqs field is used to -count the number of times this CPU is determined to be in -dyntick-idle state, and is used for tracing and debugging purposes. - -

-This portion of the rcu_data structure is declared as follows: - -

-  1   long dynticks_nesting;
-  2   long dynticks_nmi_nesting;
-  3   atomic_t dynticks;
-  4   bool rcu_need_heavy_qs;
-  5   bool rcu_urgent_qs;
-
- -

These fields in the rcu_data structure maintain the per-CPU dyntick-idle -state for the corresponding CPU. -The fields may be accessed only from the corresponding CPU (and from tracing) -unless otherwise stated. - -

The ->dynticks_nesting field counts the -nesting depth of process execution, so that in normal circumstances -this counter has value zero or one. -NMIs, irqs, and tracers are counted by the ->dynticks_nmi_nesting -field. -Because NMIs cannot be masked, changes to this variable have to be -undertaken carefully using an algorithm provided by Andy Lutomirski. -The initial transition from idle adds one, and nested transitions -add two, so that a nesting level of five is represented by a -->dynticks_nmi_nesting value of nine. -This counter can therefore be thought of as counting the number -of reasons why this CPU cannot be permitted to enter dyntick-idle -mode, aside from process-level transitions. - -

However, it turns out that when running in non-idle kernel context, -the Linux kernel is fully capable of entering interrupt handlers that -never exit and perhaps also vice versa. -Therefore, whenever the ->dynticks_nesting field is -incremented up from zero, the ->dynticks_nmi_nesting field -is set to a large positive number, and whenever the -->dynticks_nesting field is decremented down to zero, -the the ->dynticks_nmi_nesting field is set to zero. -Assuming that the number of misnested interrupts is not sufficient -to overflow the counter, this approach corrects the -->dynticks_nmi_nesting field every time the corresponding -CPU enters the idle loop from process context. - -

The ->dynticks field counts the corresponding -CPU's transitions to and from either dyntick-idle or user mode, so -that this counter has an even value when the CPU is in dyntick-idle -mode or user mode and an odd value otherwise. The transitions to/from -user mode need to be counted for user mode adaptive-ticks support -(see timers/NO_HZ.txt). - -

The ->rcu_need_heavy_qs field is used -to record the fact that the RCU core code would really like to -see a quiescent state from the corresponding CPU, so much so that -it is willing to call for heavy-weight dyntick-counter operations. -This flag is checked by RCU's context-switch and cond_resched() -code, which provide a momentary idle sojourn in response. - -

Finally, the ->rcu_urgent_qs field is used to record -the fact that the RCU core code would really like to see a quiescent state from -the corresponding CPU, with the various other fields indicating just how badly -RCU wants this quiescent state. -This flag is checked by RCU's context-switch path -(rcu_note_context_switch) and the cond_resched code. - - - - - - - - -
 
Quick Quiz:
- Why not simply combine the ->dynticks_nesting - and ->dynticks_nmi_nesting counters into a - single counter that just counts the number of reasons that - the corresponding CPU is non-idle? -
Answer:
- Because this would fail in the presence of interrupts whose - handlers never return and of handlers that manage to return - from a made-up interrupt. -
 
- -

Additional fields are present for some special-purpose -builds, and are discussed separately. - -

-The rcu_head Structure

- -

Each rcu_head structure represents an RCU callback. -These structures are normally embedded within RCU-protected data -structures whose algorithms use asynchronous grace periods. -In contrast, when using algorithms that block waiting for RCU grace periods, -RCU users need not provide rcu_head structures. - -

The rcu_head structure has fields as follows: - -

-  1   struct rcu_head *next;
-  2   void (*func)(struct rcu_head *head);
-
- -

The ->next field is used -to link the rcu_head structures together in the -lists within the rcu_data structures. -The ->func field is a pointer to the function -to be called when the callback is ready to be invoked, and -this function is passed a pointer to the rcu_head -structure. -However, kfree_rcu() uses the ->func -field to record the offset of the rcu_head -structure within the enclosing RCU-protected data structure. - -

Both of these fields are used internally by RCU. -From the viewpoint of RCU users, this structure is an -opaque “cookie”. - - - - - - - - -
 
Quick Quiz:
- Given that the callback function ->func - is passed a pointer to the rcu_head structure, - how is that function supposed to find the beginning of the - enclosing RCU-protected data structure? -
Answer:
- In actual practice, there is a separate callback function per - type of RCU-protected data structure. - The callback function can therefore use the container_of() - macro in the Linux kernel (or other pointer-manipulation facilities - in other software environments) to find the beginning of the - enclosing structure. -
 
- -

-RCU-Specific Fields in the task_struct Structure

- -

The CONFIG_PREEMPT_RCU implementation uses some -additional fields in the task_struct structure: - -

- 1 #ifdef CONFIG_PREEMPT_RCU
- 2   int rcu_read_lock_nesting;
- 3   union rcu_special rcu_read_unlock_special;
- 4   struct list_head rcu_node_entry;
- 5   struct rcu_node *rcu_blocked_node;
- 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
- 7 #ifdef CONFIG_TASKS_RCU
- 8   unsigned long rcu_tasks_nvcsw;
- 9   bool rcu_tasks_holdout;
-10   struct list_head rcu_tasks_holdout_list;
-11   int rcu_tasks_idle_cpu;
-12 #endif /* #ifdef CONFIG_TASKS_RCU */
-
- -

The ->rcu_read_lock_nesting field records the -nesting level for RCU read-side critical sections, and -the ->rcu_read_unlock_special field is a bitmask -that records special conditions that require rcu_read_unlock() -to do additional work. -The ->rcu_node_entry field is used to form lists of -tasks that have blocked within preemptible-RCU read-side critical -sections and the ->rcu_blocked_node field references -the rcu_node structure whose list this task is a member of, -or NULL if it is not blocked within a preemptible-RCU -read-side critical section. - -

The ->rcu_tasks_nvcsw field tracks the number of -voluntary context switches that this task had undergone at the -beginning of the current tasks-RCU grace period, -->rcu_tasks_holdout is set if the current tasks-RCU -grace period is waiting on this task, ->rcu_tasks_holdout_list -is a list element enqueuing this task on the holdout list, -and ->rcu_tasks_idle_cpu tracks which CPU this -idle task is running, but only if the task is currently running, -that is, if the CPU is currently idle. - -

-Accessor Functions

- -

The following listing shows the -rcu_get_root(), rcu_for_each_node_breadth_first and -rcu_for_each_leaf_node() function and macros: - -

-  1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
-  2 {
-  3   return &rsp->node[0];
-  4 }
-  5
-  6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
-  7   for ((rnp) = &(rsp)->node[0]; \
-  8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
-  9
- 10 #define rcu_for_each_leaf_node(rsp, rnp) \
- 11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
- 12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
-
- -

The rcu_get_root() simply returns a pointer to the -first element of the specified rcu_state structure's -->node[] array, which is the root rcu_node -structure. - -

As noted earlier, the rcu_for_each_node_breadth_first() -macro takes advantage of the layout of the rcu_node -structures in the rcu_state structure's -->node[] array, performing a breadth-first traversal by -simply traversing the array in order. -Similarly, the rcu_for_each_leaf_node() macro traverses only -the last part of the array, thus traversing only the leaf -rcu_node structures. - - - - - - - - -
 
Quick Quiz:
- What does - rcu_for_each_leaf_node() do if the rcu_node tree - contains only a single node? -
Answer:
- In the single-node case, - rcu_for_each_leaf_node() traverses the single node. -
 
- -

-Summary

- -So the state of RCU is represented by an rcu_state structure, -which contains a combining tree of rcu_node and -rcu_data structures. -Finally, in CONFIG_NO_HZ_IDLE kernels, each CPU's dyntick-idle -state is tracked by dynticks-related fields in the rcu_data structure. - -If you made it this far, you are well prepared to read the code -walkthroughs in the other articles in this series. - -

-Acknowledgments

- -I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul -Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn -for helping me get this document into a more human-readable state. - -

-Legal Statement

- -

This work represents the view of the author and does not necessarily -represent the view of IBM. - -

Linux is a registered trademark of Linus Torvalds. - -

Other company, product, and service names may be trademarks or -service marks of others. - - diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst new file mode 100644 index 0000000000000000000000000000000000000000..4a48e20a46f2b005ce0b31ba01fa4657b0a7172a --- /dev/null +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst @@ -0,0 +1,1163 @@ +=================================================== +A Tour Through TREE_RCU's Data Structures [LWN.net] +=================================================== + +December 18, 2016 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document describes RCU's major data structures and their relationship +to each other. + +Data-Structure Relationships +============================ + +RCU is for all intents and purposes a large state machine, and its +data structures maintain the state in such a way as to allow RCU readers +to execute extremely quickly, while also processing the RCU grace periods +requested by updaters in an efficient and extremely scalable fashion. +The efficiency and scalability of RCU updaters is provided primarily +by a combining tree, as shown below: + +.. kernel-figure:: BigTreeClassicRCU.svg + +This diagram shows an enclosing ``rcu_state`` structure containing a tree +of ``rcu_node`` structures. Each leaf node of the ``rcu_node`` tree has up +to 16 ``rcu_data`` structures associated with it, so that there are +``NR_CPUS`` number of ``rcu_data`` structures, one for each possible CPU. +This structure is adjusted at boot time, if needed, to handle the common +case where ``nr_cpu_ids`` is much less than ``NR_CPUs``. +For example, a number of Linux distributions set ``NR_CPUs=4096``, +which results in a three-level ``rcu_node`` tree. +If the actual hardware has only 16 CPUs, RCU will adjust itself +at boot time, resulting in an ``rcu_node`` tree with only a single node. + +The purpose of this combining tree is to allow per-CPU events +such as quiescent states, dyntick-idle transitions, +and CPU hotplug operations to be processed efficiently +and scalably. +Quiescent states are recorded by the per-CPU ``rcu_data`` structures, +and other events are recorded by the leaf-level ``rcu_node`` +structures. +All of these events are combined at each level of the tree until finally +grace periods are completed at the tree's root ``rcu_node`` +structure. +A grace period can be completed at the root once every CPU +(or, in the case of ``CONFIG_PREEMPT_RCU``, task) +has passed through a quiescent state. +Once a grace period has completed, record of that fact is propagated +back down the tree. + +As can be seen from the diagram, on a 64-bit system +a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout +of 64 at the root and a fanout of 16 at the leaves. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why isn't the fanout at the leaves also 64? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because there are more types of events that affect the leaf-level | +| ``rcu_node`` structures than further up the tree. Therefore, if the | +| leaf ``rcu_node`` structures have fanout of 64, the contention on | +| these structures' ``->structures`` becomes excessive. Experimentation | +| on a wide variety of systems has shown that a fanout of 16 works well | +| for the leaves of the ``rcu_node`` tree. | +| | +| Of course, further experience with systems having hundreds or | +| thousands of CPUs may demonstrate that the fanout for the non-leaf | +| ``rcu_node`` structures must also be reduced. Such reduction can be | +| easily carried out when and if it proves necessary. In the meantime, | +| if you are using such a system and running into contention problems | +| on the non-leaf ``rcu_node`` structures, you may use the | +| ``CONFIG_RCU_FANOUT`` kernel configuration parameter to reduce the | +| non-leaf fanout as needed. | +| | +| Kernels built for systems with strong NUMA characteristics might | +| also need to adjust ``CONFIG_RCU_FANOUT`` so that the domains of | +| the ``rcu_node`` structures align with hardware boundaries. | +| However, there has thus far been no need for this. | ++-----------------------------------------------------------------------+ + +If your system has more than 1,024 CPUs (or more than 512 CPUs on a +32-bit system), then RCU will automatically add more levels to the tree. +For example, if you are crazy enough to build a 64-bit system with +65,536 CPUs, RCU would configure the ``rcu_node`` tree as follows: + +.. kernel-figure:: HugeTreeClassicRCU.svg + +RCU currently permits up to a four-level tree, which on a 64-bit system +accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for +32-bit systems. On the other hand, you can set both +``CONFIG_RCU_FANOUT`` and ``CONFIG_RCU_FANOUT_LEAF`` to be as small as +2, which would result in a 16-CPU test using a 4-level tree. This can be +useful for testing large-system capabilities on small test machines. + +This multi-level combining tree allows us to get most of the performance +and scalability benefits of partitioning, even though RCU grace-period +detection is inherently a global operation. The trick here is that only +the last CPU to report a quiescent state into a given ``rcu_node`` +structure need advance to the ``rcu_node`` structure at the next level +up the tree. This means that at the leaf-level ``rcu_node`` structure, +only one access out of sixteen will progress up the tree. For the +internal ``rcu_node`` structures, the situation is even more extreme: +Only one access out of sixty-four will progress up the tree. Because the +vast majority of the CPUs do not progress up the tree, the lock +contention remains roughly constant up the tree. No matter how many CPUs +there are in the system, at most 64 quiescent-state reports per grace +period will progress all the way to the root ``rcu_node`` structure, +thus ensuring that the lock contention on that root ``rcu_node`` +structure remains acceptably low. + +In effect, the combining tree acts like a big shock absorber, keeping +lock contention under control at all tree levels regardless of the level +of loading on the system. + +RCU updaters wait for normal grace periods by registering RCU callbacks, +either directly via ``call_rcu()`` or indirectly via +``synchronize_rcu()`` and friends. RCU callbacks are represented by +``rcu_head`` structures, which are queued on ``rcu_data`` structures +while they are waiting for a grace period to elapse, as shown in the +following figure: + +.. kernel-figure:: BigTreePreemptRCUBHdyntickCB.svg + +This figure shows how ``TREE_RCU``'s and ``PREEMPT_RCU``'s major data +structures are related. Lesser data structures will be introduced with +the algorithms that make use of them. + +Note that each of the data structures in the above figure has its own +synchronization: + +#. Each ``rcu_state`` structures has a lock and a mutex, and some fields + are protected by the corresponding root ``rcu_node`` structure's lock. +#. Each ``rcu_node`` structure has a spinlock. +#. The fields in ``rcu_data`` are private to the corresponding CPU, + although a few can be read and written by other CPUs. + +It is important to note that different data structures can have very +different ideas about the state of RCU at any given time. For but one +example, awareness of the start or end of a given RCU grace period +propagates slowly through the data structures. This slow propagation is +absolutely necessary for RCU to have good read-side performance. If this +balkanized implementation seems foreign to you, one useful trick is to +consider each instance of these data structures to be a different +person, each having the usual slightly different view of reality. + +The general role of each of these data structures is as follows: + +#. ``rcu_state``: This structure forms the interconnection between the + ``rcu_node`` and ``rcu_data`` structures, tracks grace periods, + serves as short-term repository for callbacks orphaned by CPU-hotplug + events, maintains ``rcu_barrier()`` state, tracks expedited + grace-period state, and maintains state used to force quiescent + states when grace periods extend too long, +#. ``rcu_node``: This structure forms the combining tree that propagates + quiescent-state information from the leaves to the root, and also + propagates grace-period information from the root to the leaves. It + provides local copies of the grace-period state in order to allow + this information to be accessed in a synchronized manner without + suffering the scalability limitations that would otherwise be imposed + by global locking. In ``CONFIG_PREEMPT_RCU`` kernels, it manages the + lists of tasks that have blocked while in their current RCU read-side + critical section. In ``CONFIG_PREEMPT_RCU`` with + ``CONFIG_RCU_BOOST``, it manages the per-\ ``rcu_node`` + priority-boosting kernel threads (kthreads) and state. Finally, it + records CPU-hotplug state in order to determine which CPUs should be + ignored during a given grace period. +#. ``rcu_data``: This per-CPU structure is the focus of quiescent-state + detection and RCU callback queuing. It also tracks its relationship + to the corresponding leaf ``rcu_node`` structure to allow + more-efficient propagation of quiescent states up the ``rcu_node`` + combining tree. Like the ``rcu_node`` structure, it provides a local + copy of the grace-period information to allow for-free synchronized + access to this information from the corresponding CPU. Finally, this + structure records past dyntick-idle state for the corresponding CPU + and also tracks statistics. +#. ``rcu_head``: This structure represents RCU callbacks, and is the + only structure allocated and managed by RCU users. The ``rcu_head`` + structure is normally embedded within the RCU-protected data + structure. + +If all you wanted from this article was a general notion of how RCU's +data structures are related, you are done. Otherwise, each of the +following sections give more details on the ``rcu_state``, ``rcu_node`` +and ``rcu_data`` data structures. + +The ``rcu_state`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_state`` structure is the base structure that represents the +state of RCU in the system. This structure forms the interconnection +between the ``rcu_node`` and ``rcu_data`` structures, tracks grace +periods, contains the lock used to synchronize with CPU-hotplug events, +and maintains state used to force quiescent states when grace periods +extend too long, + +A few of the ``rcu_state`` structure's fields are discussed, singly and +in groups, in the following sections. The more specialized fields are +covered in the discussion of their use. + +Relationship to rcu_node and rcu_data Structures +'''''''''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 struct rcu_node node[NUM_RCU_NODES]; + 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; + 3 struct rcu_data __percpu *rda; + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that the ``rcu_node`` structures formed a | +| tree, but they are declared as a flat array! What gives? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The tree is laid out in the array. The first node In the array is the | +| head, the next set of nodes in the array are children of the head | +| node, and so on until the last set of nodes in the array are the | +| leaves. | +| See the following diagrams to see how this works. | ++-----------------------------------------------------------------------+ + +The ``rcu_node`` tree is embedded into the ``->node[]`` array as shown +in the following figure: + +.. kernel-figure:: TreeMapping.svg + +One interesting consequence of this mapping is that a breadth-first +traversal of the tree is implemented as a simple linear scan of the +array, which is in fact what the ``rcu_for_each_node_breadth_first()`` +macro does. This macro is used at the beginning and ends of grace +periods. + +Each entry of the ``->level`` array references the first ``rcu_node`` +structure on the corresponding level of the tree, for example, as shown +below: + +.. kernel-figure:: TreeMappingLevel.svg + +The zero\ :sup:`th` element of the array references the root +``rcu_node`` structure, the first element references the first child of +the root ``rcu_node``, and finally the second element references the +first leaf ``rcu_node`` structure. + +For whatever it is worth, if you draw the tree to be tree-shaped rather +than array-shaped, it is easy to draw a planar representation: + +.. kernel-figure:: TreeLevel.svg + +Finally, the ``->rda`` field references a per-CPU pointer to the +corresponding CPU's ``rcu_data`` structure. + +All of these fields are constant once initialization is complete, and +therefore need no protection. + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + +RCU grace periods are numbered, and the ``->gp_seq`` field contains the +current grace-period sequence number. The bottom two bits are the state +of the current grace period, which can be zero for not yet started or +one for in progress. In other words, if the bottom two bits of +``->gp_seq`` are zero, then RCU is idle. Any other value in the bottom +two bits indicates that something is broken. This field is protected by +the root ``rcu_node`` structure's ``->lock`` field. + +There are ``->gp_seq`` fields in the ``rcu_node`` and ``rcu_data`` +structures as well. The fields in the ``rcu_state`` structure represent +the most current value, and those of the other structures are compared +in order to detect the beginnings and ends of grace periods in a +distributed fashion. The values flow from ``rcu_state`` to ``rcu_node`` +(down the tree from the root to the leaves) to ``rcu_data``. + +Miscellaneous +''''''''''''' + +This portion of the ``rcu_state`` structure is declared as follows: + +:: + + 1 unsigned long gp_max; + 2 char abbr; + 3 char *name; + +The ``->gp_max`` field tracks the duration of the longest grace period +in jiffies. It is protected by the root ``rcu_node``'s ``->lock``. + +The ``->name`` and ``->abbr`` fields distinguish between preemptible RCU +(“rcu_preempt” and “p”) and non-preemptible RCU (“rcu_sched” and “s”). +These fields are used for diagnostic and tracing purposes. + +The ``rcu_node`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_node`` structures form the combining tree that propagates +quiescent-state information from the leaves to the root and also that +propagates grace-period information from the root down to the leaves. +They provides local copies of the grace-period state in order to allow +this information to be accessed in a synchronized manner without +suffering the scalability limitations that would otherwise be imposed by +global locking. In ``CONFIG_PREEMPT_RCU`` kernels, they manage the lists +of tasks that have blocked while in their current RCU read-side critical +section. In ``CONFIG_PREEMPT_RCU`` with ``CONFIG_RCU_BOOST``, they +manage the per-\ ``rcu_node`` priority-boosting kernel threads +(kthreads) and state. Finally, they record CPU-hotplug state in order to +determine which CPUs should be ignored during a given grace period. + +The ``rcu_node`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Combining Tree +'''''''''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 struct rcu_node *parent; + 2 u8 level; + 3 u8 grpnum; + 4 unsigned long grpmask; + 5 int grplo; + 6 int grphi; + +The ``->parent`` pointer references the ``rcu_node`` one level up in the +tree, and is ``NULL`` for the root ``rcu_node``. The RCU implementation +makes heavy use of this field to push quiescent states up the tree. The +``->level`` field gives the level in the tree, with the root being at +level zero, its children at level one, and so on. The ``->grpnum`` field +gives this node's position within the children of its parent, so this +number can range between 0 and 31 on 32-bit systems and between 0 and 63 +on 64-bit systems. The ``->level`` and ``->grpnum`` fields are used only +during initialization and for tracing. The ``->grpmask`` field is the +bitmask counterpart of ``->grpnum``, and therefore always has exactly +one bit set. This mask is used to clear the bit corresponding to this +``rcu_node`` structure in its parent's bitmasks, which are described +later. Finally, the ``->grplo`` and ``->grphi`` fields contain the +lowest and highest numbered CPU served by this ``rcu_node`` structure, +respectively. + +All of these fields are constant, and thus do not require any +synchronization. + +Synchronization +''''''''''''''' + +This field of the ``rcu_node`` structure is declared as follows: + +:: + + 1 raw_spinlock_t lock; + +This field is used to protect the remaining fields in this structure, +unless otherwise stated. That said, all of the fields in this structure +can be accessed without locking for tracing purposes. Yes, this can +result in confusing traces, but better some tracing confusion than to be +heisenbugged out of existence. + +.. _grace-period-tracking-1: + +Grace-Period Tracking +''''''''''''''''''''' + +This portion of the ``rcu_node`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + +The ``rcu_node`` structures' ``->gp_seq`` fields are the counterparts of +the field of the same name in the ``rcu_state`` structure. They each may +lag up to one step behind their ``rcu_state`` counterpart. If the bottom +two bits of a given ``rcu_node`` structure's ``->gp_seq`` field is zero, +then this ``rcu_node`` structure believes that RCU is idle. + +The ``>gp_seq`` field of each ``rcu_node`` structure is updated at the +beginning and the end of each grace period. + +The ``->gp_seq_needed`` fields record the furthest-in-the-future grace +period request seen by the corresponding ``rcu_node`` structure. The +request is considered fulfilled when the value of the ``->gp_seq`` field +equals or exceeds that of the ``->gp_seq_needed`` field. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that this ``rcu_node`` structure doesn't see a request for a | +| very long time. Won't wrapping of the ``->gp_seq`` field cause | +| problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, because if the ``->gp_seq_needed`` field lags behind the | +| ``->gp_seq`` field, the ``->gp_seq_needed`` field will be updated at | +| the end of the grace period. Modulo-arithmetic comparisons therefore | +| will always get the correct answer, even with wrapping. | ++-----------------------------------------------------------------------+ + +Quiescent-State Tracking +'''''''''''''''''''''''' + +These fields manage the propagation of quiescent states up the combining +tree. + +This portion of the ``rcu_node`` structure has fields as follows: + +:: + + 1 unsigned long qsmask; + 2 unsigned long expmask; + 3 unsigned long qsmaskinit; + 4 unsigned long expmaskinit; + +The ``->qsmask`` field tracks which of this ``rcu_node`` structure's +children still need to report quiescent states for the current normal +grace period. Such children will have a value of 1 in their +corresponding bit. Note that the leaf ``rcu_node`` structures should be +thought of as having ``rcu_data`` structures as their children. +Similarly, the ``->expmask`` field tracks which of this ``rcu_node`` +structure's children still need to report quiescent states for the +current expedited grace period. An expedited grace period has the same +conceptual properties as a normal grace period, but the expedited +implementation accepts extreme CPU overhead to obtain much lower +grace-period latency, for example, consuming a few tens of microseconds +worth of CPU time to reduce grace-period duration from milliseconds to +tens of microseconds. The ``->qsmaskinit`` field tracks which of this +``rcu_node`` structure's children cover for at least one online CPU. +This mask is used to initialize ``->qsmask``, and ``->expmaskinit`` is +used to initialize ``->expmask`` and the beginning of the normal and +expedited grace periods, respectively. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why are these bitmasks protected by locking? Come on, haven't you | +| heard of atomic instructions??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Lockless grace-period computation! Such a tantalizing possibility! | +| But consider the following sequence of events: | +| | +| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | +| wakes up, it notices that the current RCU grace period needs it to | +| report in, so it sets a flag where the scheduling clock interrupt | +| will find it. | +| #. Meanwhile, CPU 1 is running ``force_quiescent_state()``, and | +| notices that CPU 0 has been in dyntick idle mode, which qualifies | +| as an extended quiescent state. | +| #. CPU 0's scheduling clock interrupt fires in the middle of an RCU | +| read-side critical section, and notices that the RCU core needs | +| something, so commences RCU softirq processing. | +| #. CPU 0's softirq handler executes and is just about ready to report | +| its quiescent state up the ``rcu_node`` tree. | +| #. But CPU 1 beats it to the punch, completing the current grace | +| period and starting a new one. | +| #. CPU 0 now reports its quiescent state for the wrong grace period. | +| That grace period might now end before the RCU read-side critical | +| section. If that happens, disaster will ensue. | +| | +| So the locking is absolutely required in order to coordinate clearing | +| of the bits with updating of the grace-period sequence number in | +| ``->gp_seq``. | ++-----------------------------------------------------------------------+ + +Blocked-Task Management +''''''''''''''''''''''' + +``PREEMPT_RCU`` allows tasks to be preempted in the midst of their RCU +read-side critical sections, and these tasks must be tracked explicitly. +The details of exactly why and how they are tracked will be covered in a +separate article on RCU read-side processing. For now, it is enough to +know that the ``rcu_node`` structure tracks them. + +:: + + 1 struct list_head blkd_tasks; + 2 struct list_head *gp_tasks; + 3 struct list_head *exp_tasks; + 4 bool wait_blkd_tasks; + +The ``->blkd_tasks`` field is a list header for the list of blocked and +preempted tasks. As tasks undergo context switches within RCU read-side +critical sections, their ``task_struct`` structures are enqueued (via +the ``task_struct``'s ``->rcu_node_entry`` field) onto the head of the +``->blkd_tasks`` list for the leaf ``rcu_node`` structure corresponding +to the CPU on which the outgoing context switch executed. As these tasks +later exit their RCU read-side critical sections, they remove themselves +from the list. This list is therefore in reverse time order, so that if +one of the tasks is blocking the current grace period, all subsequent +tasks must also be blocking that same grace period. Therefore, a single +pointer into this list suffices to track all tasks blocking a given +grace period. That pointer is stored in ``->gp_tasks`` for normal grace +periods and in ``->exp_tasks`` for expedited grace periods. These last +two fields are ``NULL`` if either there is no grace period in flight or +if there are no blocked tasks preventing that grace period from +completing. If either of these two pointers is referencing a task that +removes itself from the ``->blkd_tasks`` list, then that task must +advance the pointer to the next task on the list, or set the pointer to +``NULL`` if there are no subsequent tasks on the list. + +For example, suppose that tasks T1, T2, and T3 are all hard-affinitied +to the largest-numbered CPU in the system. Then if task T1 blocked in an +RCU read-side critical section, then an expedited grace period started, +then task T2 blocked in an RCU read-side critical section, then a normal +grace period started, and finally task 3 blocked in an RCU read-side +critical section, then the state of the last leaf ``rcu_node`` +structure's blocked-task list would be as shown below: + +.. kernel-figure:: blkd_task.svg + +Task T1 is blocking both grace periods, task T2 is blocking only the +normal grace period, and task T3 is blocking neither grace period. Note +that these tasks will not remove themselves from this list immediately +upon resuming execution. They will instead remain on the list until they +execute the outermost ``rcu_read_unlock()`` that ends their RCU +read-side critical section. + +The ``->wait_blkd_tasks`` field indicates whether or not the current +grace period is waiting on a blocked task. + +Sizing the ``rcu_node`` Array +''''''''''''''''''''''''''''' + +The ``rcu_node`` array is sized via a series of C-preprocessor +expressions as follows: + +:: + + 1 #ifdef CONFIG_RCU_FANOUT + 2 #define RCU_FANOUT CONFIG_RCU_FANOUT + 3 #else + 4 # ifdef CONFIG_64BIT + 5 # define RCU_FANOUT 64 + 6 # else + 7 # define RCU_FANOUT 32 + 8 # endif + 9 #endif + 10 + 11 #ifdef CONFIG_RCU_FANOUT_LEAF + 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF + 13 #else + 14 # ifdef CONFIG_64BIT + 15 # define RCU_FANOUT_LEAF 64 + 16 # else + 17 # define RCU_FANOUT_LEAF 32 + 18 # endif + 19 #endif + 20 + 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) + 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) + 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) + 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) + 25 + 26 #if NR_CPUS <= RCU_FANOUT_1 + 27 # define RCU_NUM_LVLS 1 + 28 # define NUM_RCU_LVL_0 1 + 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 + 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } + 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } + 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } + 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } + 34 #elif NR_CPUS <= RCU_FANOUT_2 + 35 # define RCU_NUM_LVLS 2 + 36 # define NUM_RCU_LVL_0 1 + 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) + 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } + 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } + 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } + 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } + 43 #elif NR_CPUS <= RCU_FANOUT_3 + 44 # define RCU_NUM_LVLS 3 + 45 # define NUM_RCU_LVL_0 1 + 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) + 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } + 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } + 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } + 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } + 53 #elif NR_CPUS <= RCU_FANOUT_4 + 54 # define RCU_NUM_LVLS 4 + 55 # define NUM_RCU_LVL_0 1 + 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) + 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) + 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) + 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) + 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } + 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } + 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } + 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } + 64 #else + 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" + 66 #endif + +The maximum number of levels in the ``rcu_node`` structure is currently +limited to four, as specified by lines 21-24 and the structure of the +subsequent “if” statement. For 32-bit systems, this allows +16*32*32*32=524,288 CPUs, which should be sufficient for the next few +years at least. For 64-bit systems, 16*64*64*64=4,194,304 CPUs is +allowed, which should see us through the next decade or so. This +four-level tree also allows kernels built with ``CONFIG_RCU_FANOUT=8`` +to support up to 4096 CPUs, which might be useful in very large systems +having eight CPUs per socket (but please note that no one has yet shown +any measurable performance degradation due to misaligned socket and +``rcu_node`` boundaries). In addition, building kernels with a full four +levels of ``rcu_node`` tree permits better testing of RCU's +combining-tree code. + +The ``RCU_FANOUT`` symbol controls how many children are permitted at +each non-leaf level of the ``rcu_node`` tree. If the +``CONFIG_RCU_FANOUT`` Kconfig option is not specified, it is set based +on the word size of the system, which is also the Kconfig default. + +The ``RCU_FANOUT_LEAF`` symbol controls how many CPUs are handled by +each leaf ``rcu_node`` structure. Experience has shown that allowing a +given leaf ``rcu_node`` structure to handle 64 CPUs, as permitted by the +number of bits in the ``->qsmask`` field on a 64-bit system, results in +excessive contention for the leaf ``rcu_node`` structures' ``->lock`` +fields. The number of CPUs per leaf ``rcu_node`` structure is therefore +limited to 16 given the default value of ``CONFIG_RCU_FANOUT_LEAF``. If +``CONFIG_RCU_FANOUT_LEAF`` is unspecified, the value selected is based +on the word size of the system, just as for ``CONFIG_RCU_FANOUT``. +Lines 11-19 perform this computation. + +Lines 21-24 compute the maximum number of CPUs supported by a +single-level (which contains a single ``rcu_node`` structure), +two-level, three-level, and four-level ``rcu_node`` tree, respectively, +given the fanout specified by ``RCU_FANOUT`` and ``RCU_FANOUT_LEAF``. +These numbers of CPUs are retained in the ``RCU_FANOUT_1``, +``RCU_FANOUT_2``, ``RCU_FANOUT_3``, and ``RCU_FANOUT_4`` C-preprocessor +variables, respectively. + +These variables are used to control the C-preprocessor ``#if`` statement +spanning lines 26-66 that computes the number of ``rcu_node`` structures +required for each level of the tree, as well as the number of levels +required. The number of levels is placed in the ``NUM_RCU_LVLS`` +C-preprocessor variable by lines 27, 35, 44, and 54. The number of +``rcu_node`` structures for the topmost level of the tree is always +exactly one, and this value is unconditionally placed into +``NUM_RCU_LVL_0`` by lines 28, 36, 45, and 55. The rest of the levels +(if any) of the ``rcu_node`` tree are computed by dividing the maximum +number of CPUs by the fanout supported by the number of levels from the +current level down, rounding up. This computation is performed by +lines 37, 46-47, and 56-58. Lines 31-33, 40-42, 50-52, and 62-63 create +initializers for lockdep lock-class names. Finally, lines 64-66 produce +an error if the maximum number of CPUs is too large for the specified +fanout. + +The ``rcu_segcblist`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_segcblist`` structure maintains a segmented list of callbacks +as follows: + +:: + + 1 #define RCU_DONE_TAIL 0 + 2 #define RCU_WAIT_TAIL 1 + 3 #define RCU_NEXT_READY_TAIL 2 + 4 #define RCU_NEXT_TAIL 3 + 5 #define RCU_CBLIST_NSEGS 4 + 6 + 7 struct rcu_segcblist { + 8 struct rcu_head *head; + 9 struct rcu_head **tails[RCU_CBLIST_NSEGS]; + 10 unsigned long gp_seq[RCU_CBLIST_NSEGS]; + 11 long len; + 12 long len_lazy; + 13 }; + +The segments are as follows: + +#. ``RCU_DONE_TAIL``: Callbacks whose grace periods have elapsed. These + callbacks are ready to be invoked. +#. ``RCU_WAIT_TAIL``: Callbacks that are waiting for the current grace + period. Note that different CPUs can have different ideas about which + grace period is current, hence the ``->gp_seq`` field. +#. ``RCU_NEXT_READY_TAIL``: Callbacks waiting for the next grace period + to start. +#. ``RCU_NEXT_TAIL``: Callbacks that have not yet been associated with a + grace period. + +The ``->head`` pointer references the first callback or is ``NULL`` if +the list contains no callbacks (which is *not* the same as being empty). +Each element of the ``->tails[]`` array references the ``->next`` +pointer of the last callback in the corresponding segment of the list, +or the list's ``->head`` pointer if that segment and all previous +segments are empty. If the corresponding segment is empty but some +previous segment is not empty, then the array element is identical to +its predecessor. Older callbacks are closer to the head of the list, and +new callbacks are added at the tail. This relationship between the +``->head`` pointer, the ``->tails[]`` array, and the callbacks is shown +in this diagram: + +.. kernel-figure:: nxtlist.svg + +In this figure, the ``->head`` pointer references the first RCU callback +in the list. The ``->tails[RCU_DONE_TAIL]`` array element references the +``->head`` pointer itself, indicating that none of the callbacks is +ready to invoke. The ``->tails[RCU_WAIT_TAIL]`` array element references +callback CB 2's ``->next`` pointer, which indicates that CB 1 and CB 2 +are both waiting on the current grace period, give or take possible +disagreements about exactly which grace period is the current one. The +``->tails[RCU_NEXT_READY_TAIL]`` array element references the same RCU +callback that ``->tails[RCU_WAIT_TAIL]`` does, which indicates that +there are no callbacks waiting on the next RCU grace period. The +``->tails[RCU_NEXT_TAIL]`` array element references CB 4's ``->next`` +pointer, indicating that all the remaining RCU callbacks have not yet +been assigned to an RCU grace period. Note that the +``->tails[RCU_NEXT_TAIL]`` array element always references the last RCU +callback's ``->next`` pointer unless the callback list is empty, in +which case it references the ``->head`` pointer. + +There is one additional important special case for the +``->tails[RCU_NEXT_TAIL]`` array element: It can be ``NULL`` when this +list is *disabled*. Lists are disabled when the corresponding CPU is +offline or when the corresponding CPU's callbacks are offloaded to a +kthread, both of which are described elsewhere. + +CPUs advance their callbacks from the ``RCU_NEXT_TAIL`` to the +``RCU_NEXT_READY_TAIL`` to the ``RCU_WAIT_TAIL`` to the +``RCU_DONE_TAIL`` list segments as grace periods advance. + +The ``->gp_seq[]`` array records grace-period numbers corresponding to +the list segments. This is what allows different CPUs to have different +ideas as to which is the current grace period while still avoiding +premature invocation of their callbacks. In particular, this allows CPUs +that go idle for extended periods to determine which of their callbacks +are ready to be invoked after reawakening. + +The ``->len`` counter contains the number of callbacks in ``->head``, +and the ``->len_lazy`` contains the number of those callbacks that are +known to only free memory, and whose invocation can therefore be safely +deferred. + +.. important:: + + It is the ``->len`` field that determines whether or + not there are callbacks associated with this ``rcu_segcblist`` + structure, *not* the ``->head`` pointer. The reason for this is that all + the ready-to-invoke callbacks (that is, those in the ``RCU_DONE_TAIL`` + segment) are extracted all at once at callback-invocation time + (``rcu_do_batch``), due to which ``->head`` may be set to NULL if there + are no not-done callbacks remaining in the ``rcu_segcblist``. If + callback invocation must be postponed, for example, because a + high-priority process just woke up on this CPU, then the remaining + callbacks are placed back on the ``RCU_DONE_TAIL`` segment and + ``->head`` once again points to the start of the segment. In short, the + head field can briefly be ``NULL`` even though the CPU has callbacks + present the entire time. Therefore, it is not appropriate to test the + ``->head`` pointer for ``NULL``. + +In contrast, the ``->len`` and ``->len_lazy`` counts are adjusted only +after the corresponding callbacks have been invoked. This means that the +``->len`` count is zero only if the ``rcu_segcblist`` structure really +is devoid of callbacks. Of course, off-CPU sampling of the ``->len`` +count requires careful use of appropriate synchronization, for example, +memory barriers. This synchronization can be a bit subtle, particularly +in the case of ``rcu_barrier()``. + +The ``rcu_data`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``rcu_data`` maintains the per-CPU state for the RCU subsystem. The +fields in this structure may be accessed only from the corresponding CPU +(and from tracing) unless otherwise stated. This structure is the focus +of quiescent-state detection and RCU callback queuing. It also tracks +its relationship to the corresponding leaf ``rcu_node`` structure to +allow more-efficient propagation of quiescent states up the ``rcu_node`` +combining tree. Like the ``rcu_node`` structure, it provides a local +copy of the grace-period information to allow for-free synchronized +access to this information from the corresponding CPU. Finally, this +structure records past dyntick-idle state for the corresponding CPU and +also tracks statistics. + +The ``rcu_data`` structure's fields are discussed, singly and in groups, +in the following sections. + +Connection to Other Data Structures +''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int cpu; + 2 struct rcu_node *mynode; + 3 unsigned long grpmask; + 4 bool beenonline; + +The ``->cpu`` field contains the number of the corresponding CPU and the +``->mynode`` field references the corresponding ``rcu_node`` structure. +The ``->mynode`` is used to propagate quiescent states up the combining +tree. These two fields are constant and therefore do not require +synchronization. + +The ``->grpmask`` field indicates the bit in the ``->mynode->qsmask`` +corresponding to this ``rcu_data`` structure, and is also used when +propagating quiescent states. The ``->beenonline`` flag is set whenever +the corresponding CPU comes online, which means that the debugfs tracing +need not dump out any ``rcu_data`` structure for which this flag is not +set. + +Quiescent-State and Grace-Period Tracking +''''''''''''''''''''''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 unsigned long gp_seq; + 2 unsigned long gp_seq_needed; + 3 bool cpu_no_qs; + 4 bool core_needs_qs; + 5 bool gpwrap; + +The ``->gp_seq`` field is the counterpart of the field of the same name +in the ``rcu_state`` and ``rcu_node`` structures. The +``->gp_seq_needed`` field is the counterpart of the field of the same +name in the rcu_node structure. They may each lag up to one behind their +``rcu_node`` counterparts, but in ``CONFIG_NO_HZ_IDLE`` and +``CONFIG_NO_HZ_FULL`` kernels can lag arbitrarily far behind for CPUs in +dyntick-idle mode (but these counters will catch up upon exit from +dyntick-idle mode). If the lower two bits of a given ``rcu_data`` +structure's ``->gp_seq`` are zero, then this ``rcu_data`` structure +believes that RCU is idle. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| All this replication of the grace period numbers can only cause | +| massive confusion. Why not just keep a global sequence number and be | +| done with it??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because if there was only a single global sequence numbers, there | +| would need to be a single global lock to allow safely accessing and | +| updating it. And if we are not going to have a single global lock, we | +| need to carefully manage the numbers on a per-node basis. Recall from | +| the answer to a previous Quick Quiz that the consequences of applying | +| a previously sampled quiescent state to the wrong grace period are | +| quite severe. | ++-----------------------------------------------------------------------+ + +The ``->cpu_no_qs`` flag indicates that the CPU has not yet passed +through a quiescent state, while the ``->core_needs_qs`` flag indicates +that the RCU core needs a quiescent state from the corresponding CPU. +The ``->gpwrap`` field indicates that the corresponding CPU has remained +idle for so long that the ``gp_seq`` counter is in danger of overflow, +which will cause the CPU to disregard the values of its counters on its +next exit from idle. + +RCU Callback Handling +''''''''''''''''''''' + +In the absence of CPU-hotplug events, RCU callbacks are invoked by the +same CPU that registered them. This is strictly a cache-locality +optimization: callbacks can and do get invoked on CPUs other than the +one that registered them. After all, if the CPU that registered a given +callback has gone offline before the callback can be invoked, there +really is no other choice. + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 struct rcu_segcblist cblist; + 2 long qlen_last_fqs_check; + 3 unsigned long n_cbs_invoked; + 4 unsigned long n_nocbs_invoked; + 5 unsigned long n_cbs_orphaned; + 6 unsigned long n_cbs_adopted; + 7 unsigned long n_force_qs_snap; + 8 long blimit; + +The ``->cblist`` structure is the segmented callback list described +earlier. The CPU advances the callbacks in its ``rcu_data`` structure +whenever it notices that another RCU grace period has completed. The CPU +detects the completion of an RCU grace period by noticing that the value +of its ``rcu_data`` structure's ``->gp_seq`` field differs from that of +its leaf ``rcu_node`` structure. Recall that each ``rcu_node`` +structure's ``->gp_seq`` field is updated at the beginnings and ends of +each grace period. + +The ``->qlen_last_fqs_check`` and ``->n_force_qs_snap`` coordinate the +forcing of quiescent states from ``call_rcu()`` and friends when +callback lists grow excessively long. + +The ``->n_cbs_invoked``, ``->n_cbs_orphaned``, and ``->n_cbs_adopted`` +fields count the number of callbacks invoked, sent to other CPUs when +this CPU goes offline, and received from other CPUs when those other +CPUs go offline. The ``->n_nocbs_invoked`` is used when the CPU's +callbacks are offloaded to a kthread. + +Finally, the ``->blimit`` counter is the maximum number of RCU callbacks +that may be invoked at a given time. + +Dyntick-Idle Handling +''''''''''''''''''''' + +This portion of the ``rcu_data`` structure is declared as follows: + +:: + + 1 int dynticks_snap; + 2 unsigned long dynticks_fqs; + +The ``->dynticks_snap`` field is used to take a snapshot of the +corresponding CPU's dyntick-idle state when forcing quiescent states, +and is therefore accessed from other CPUs. Finally, the +``->dynticks_fqs`` field is used to count the number of times this CPU +is determined to be in dyntick-idle state, and is used for tracing and +debugging purposes. + +This portion of the rcu_data structure is declared as follows: + +:: + + 1 long dynticks_nesting; + 2 long dynticks_nmi_nesting; + 3 atomic_t dynticks; + 4 bool rcu_need_heavy_qs; + 5 bool rcu_urgent_qs; + +These fields in the rcu_data structure maintain the per-CPU dyntick-idle +state for the corresponding CPU. The fields may be accessed only from +the corresponding CPU (and from tracing) unless otherwise stated. + +The ``->dynticks_nesting`` field counts the nesting depth of process +execution, so that in normal circumstances this counter has value zero +or one. NMIs, irqs, and tracers are counted by the +``->dynticks_nmi_nesting`` field. Because NMIs cannot be masked, changes +to this variable have to be undertaken carefully using an algorithm +provided by Andy Lutomirski. The initial transition from idle adds one, +and nested transitions add two, so that a nesting level of five is +represented by a ``->dynticks_nmi_nesting`` value of nine. This counter +can therefore be thought of as counting the number of reasons why this +CPU cannot be permitted to enter dyntick-idle mode, aside from +process-level transitions. + +However, it turns out that when running in non-idle kernel context, the +Linux kernel is fully capable of entering interrupt handlers that never +exit and perhaps also vice versa. Therefore, whenever the +``->dynticks_nesting`` field is incremented up from zero, the +``->dynticks_nmi_nesting`` field is set to a large positive number, and +whenever the ``->dynticks_nesting`` field is decremented down to zero, +the the ``->dynticks_nmi_nesting`` field is set to zero. Assuming that +the number of misnested interrupts is not sufficient to overflow the +counter, this approach corrects the ``->dynticks_nmi_nesting`` field +every time the corresponding CPU enters the idle loop from process +context. + +The ``->dynticks`` field counts the corresponding CPU's transitions to +and from either dyntick-idle or user mode, so that this counter has an +even value when the CPU is in dyntick-idle mode or user mode and an odd +value otherwise. The transitions to/from user mode need to be counted +for user mode adaptive-ticks support (see timers/NO_HZ.txt). + +The ``->rcu_need_heavy_qs`` field is used to record the fact that the +RCU core code would really like to see a quiescent state from the +corresponding CPU, so much so that it is willing to call for +heavy-weight dyntick-counter operations. This flag is checked by RCU's +context-switch and ``cond_resched()`` code, which provide a momentary +idle sojourn in response. + +Finally, the ``->rcu_urgent_qs`` field is used to record the fact that +the RCU core code would really like to see a quiescent state from the +corresponding CPU, with the various other fields indicating just how +badly RCU wants this quiescent state. This flag is checked by RCU's +context-switch path (``rcu_note_context_switch``) and the cond_resched +code. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not simply combine the ``->dynticks_nesting`` and | +| ``->dynticks_nmi_nesting`` counters into a single counter that just | +| counts the number of reasons that the corresponding CPU is non-idle? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because this would fail in the presence of interrupts whose handlers | +| never return and of handlers that manage to return from a made-up | +| interrupt. | ++-----------------------------------------------------------------------+ + +Additional fields are present for some special-purpose builds, and are +discussed separately. + +The ``rcu_head`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Each ``rcu_head`` structure represents an RCU callback. These structures +are normally embedded within RCU-protected data structures whose +algorithms use asynchronous grace periods. In contrast, when using +algorithms that block waiting for RCU grace periods, RCU users need not +provide ``rcu_head`` structures. + +The ``rcu_head`` structure has fields as follows: + +:: + + 1 struct rcu_head *next; + 2 void (*func)(struct rcu_head *head); + +The ``->next`` field is used to link the ``rcu_head`` structures +together in the lists within the ``rcu_data`` structures. The ``->func`` +field is a pointer to the function to be called when the callback is +ready to be invoked, and this function is passed a pointer to the +``rcu_head`` structure. However, ``kfree_rcu()`` uses the ``->func`` +field to record the offset of the ``rcu_head`` structure within the +enclosing RCU-protected data structure. + +Both of these fields are used internally by RCU. From the viewpoint of +RCU users, this structure is an opaque “cookie”. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that the callback function ``->func`` is passed a pointer to | +| the ``rcu_head`` structure, how is that function supposed to find the | +| beginning of the enclosing RCU-protected data structure? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In actual practice, there is a separate callback function per type of | +| RCU-protected data structure. The callback function can therefore use | +| the ``container_of()`` macro in the Linux kernel (or other | +| pointer-manipulation facilities in other software environments) to | +| find the beginning of the enclosing structure. | ++-----------------------------------------------------------------------+ + +RCU-Specific Fields in the ``task_struct`` Structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The ``CONFIG_PREEMPT_RCU`` implementation uses some additional fields in +the ``task_struct`` structure: + +:: + + 1 #ifdef CONFIG_PREEMPT_RCU + 2 int rcu_read_lock_nesting; + 3 union rcu_special rcu_read_unlock_special; + 4 struct list_head rcu_node_entry; + 5 struct rcu_node *rcu_blocked_node; + 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ + 7 #ifdef CONFIG_TASKS_RCU + 8 unsigned long rcu_tasks_nvcsw; + 9 bool rcu_tasks_holdout; + 10 struct list_head rcu_tasks_holdout_list; + 11 int rcu_tasks_idle_cpu; + 12 #endif /* #ifdef CONFIG_TASKS_RCU */ + +The ``->rcu_read_lock_nesting`` field records the nesting level for RCU +read-side critical sections, and the ``->rcu_read_unlock_special`` field +is a bitmask that records special conditions that require +``rcu_read_unlock()`` to do additional work. The ``->rcu_node_entry`` +field is used to form lists of tasks that have blocked within +preemptible-RCU read-side critical sections and the +``->rcu_blocked_node`` field references the ``rcu_node`` structure whose +list this task is a member of, or ``NULL`` if it is not blocked within a +preemptible-RCU read-side critical section. + +The ``->rcu_tasks_nvcsw`` field tracks the number of voluntary context +switches that this task had undergone at the beginning of the current +tasks-RCU grace period, ``->rcu_tasks_holdout`` is set if the current +tasks-RCU grace period is waiting on this task, +``->rcu_tasks_holdout_list`` is a list element enqueuing this task on +the holdout list, and ``->rcu_tasks_idle_cpu`` tracks which CPU this +idle task is running, but only if the task is currently running, that +is, if the CPU is currently idle. + +Accessor Functions +~~~~~~~~~~~~~~~~~~ + +The following listing shows the ``rcu_get_root()``, +``rcu_for_each_node_breadth_first`` and ``rcu_for_each_leaf_node()`` +function and macros: + +:: + + 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) + 2 { + 3 return &rsp->node[0]; + 4 } + 5 + 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ + 7 for ((rnp) = &(rsp)->node[0]; \ + 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + 9 + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + +The ``rcu_get_root()`` simply returns a pointer to the first element of +the specified ``rcu_state`` structure's ``->node[]`` array, which is the +root ``rcu_node`` structure. + +As noted earlier, the ``rcu_for_each_node_breadth_first()`` macro takes +advantage of the layout of the ``rcu_node`` structures in the +``rcu_state`` structure's ``->node[]`` array, performing a breadth-first +traversal by simply traversing the array in order. Similarly, the +``rcu_for_each_leaf_node()`` macro traverses only the last part of the +array, thus traversing only the leaf ``rcu_node`` structures. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What does ``rcu_for_each_leaf_node()`` do if the ``rcu_node`` tree | +| contains only a single node? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In the single-node case, ``rcu_for_each_leaf_node()`` traverses the | +| single node. | ++-----------------------------------------------------------------------+ + +Summary +~~~~~~~ + +So the state of RCU is represented by an ``rcu_state`` structure, which +contains a combining tree of ``rcu_node`` and ``rcu_data`` structures. +Finally, in ``CONFIG_NO_HZ_IDLE`` kernels, each CPU's dyntick-idle state +is tracked by dynticks-related fields in the ``rcu_data`` structure. If +you made it this far, you are well prepared to read the code +walkthroughs in the other articles in this series. + +Acknowledgments +~~~~~~~~~~~~~~~ + +I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul +Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn for +helping me get this document into a more human-readable state. + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html deleted file mode 100644 index 57300db4b5ff607c30563bed1e8104c55f8e4b8e..0000000000000000000000000000000000000000 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ /dev/null @@ -1,668 +0,0 @@ - - - A Tour Through TREE_RCU's Expedited Grace Periods - - -

Introduction

- -This document describes RCU's expedited grace periods. -Unlike RCU's normal grace periods, which accept long latencies to attain -high efficiency and minimal disturbance, expedited grace periods accept -lower efficiency and significant disturbance to attain shorter latencies. - -

-There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier -third RCU-bh flavor having been implemented in terms of the other two. -Each of the two implementations is covered in its own section. - -

    -
  1. - Expedited Grace Period Design -
  2. - RCU-preempt Expedited Grace Periods -
  3. - RCU-sched Expedited Grace Periods -
  4. - Expedited Grace Period and CPU Hotplug -
  5. - Expedited Grace Period Refinements -
- -

-Expedited Grace Period Design

- -

-The expedited RCU grace periods cannot be accused of being subtle, -given that they for all intents and purposes hammer every CPU that -has not yet provided a quiescent state for the current expedited -grace period. -The one saving grace is that the hammer has grown a bit smaller -over time: The old call to try_stop_cpus() has been -replaced with a set of calls to smp_call_function_single(), -each of which results in an IPI to the target CPU. -The corresponding handler function checks the CPU's state, motivating -a faster quiescent state where possible, and triggering a report -of that quiescent state. -As always for RCU, once everything has spent some time in a quiescent -state, the expedited grace period has completed. - -

-The details of the smp_call_function_single() handler's -operation depend on the RCU flavor, as described in the following -sections. - -

-RCU-preempt Expedited Grace Periods

- -

-CONFIG_PREEMPT=y kernels implement RCU-preempt. -The overall flow of the handling of a given CPU by an RCU-preempt -expedited grace period is shown in the following diagram: - -

ExpRCUFlow.svg - -

-The solid arrows denote direct action, for example, a function call. -The dotted arrows denote indirect action, for example, an IPI -or a state that is reached after some time. - -

-If a given CPU is offline or idle, synchronize_rcu_expedited() -will ignore it because idle and offline CPUs are already residing -in quiescent states. -Otherwise, the expedited grace period will use -smp_call_function_single() to send the CPU an IPI, which -is handled by rcu_exp_handler(). - -

-However, because this is preemptible RCU, rcu_exp_handler() -can check to see if the CPU is currently running in an RCU read-side -critical section. -If not, the handler can immediately report a quiescent state. -Otherwise, it sets flags so that the outermost rcu_read_unlock() -invocation will provide the needed quiescent-state report. -This flag-setting avoids the previous forced preemption of all -CPUs that might have RCU read-side critical sections. -In addition, this flag-setting is done so as to avoid increasing -the overhead of the common-case fastpath through the scheduler. - -

-Again because this is preemptible RCU, an RCU read-side critical section -can be preempted. -When that happens, RCU will enqueue the task, which will the continue to -block the current expedited grace period until it resumes and finds its -outermost rcu_read_unlock(). -The CPU will report a quiescent state just after enqueuing the task because -the CPU is no longer blocking the grace period. -It is instead the preempted task doing the blocking. -The list of blocked tasks is managed by rcu_preempt_ctxt_queue(), -which is called from rcu_preempt_note_context_switch(), which -in turn is called from rcu_note_context_switch(), which in -turn is called from the scheduler. - - - - - - - - -
 
Quick Quiz:
- Why not just have the expedited grace period check the - state of all the CPUs? - After all, that would avoid all those real-time-unfriendly IPIs. -
Answer:
- Because we want the RCU read-side critical sections to run fast, - which means no memory barriers. - Therefore, it is not possible to safely check the state from some - other CPU. - And even if it was possible to safely check the state, it would - still be necessary to IPI the CPU to safely interact with the - upcoming rcu_read_unlock() invocation, which means that - the remote state testing would not help the worst-case - latency that real-time applications care about. - -

One way to prevent your real-time - application from getting hit with these IPIs is to - build your kernel with CONFIG_NO_HZ_FULL=y. - RCU would then perceive the CPU running your application - as being idle, and it would be able to safely detect that - state without needing to IPI the CPU. -

 
- -

-Please note that this is just the overall flow: -Additional complications can arise due to races with CPUs going idle -or offline, among other things. - -

-RCU-sched Expedited Grace Periods

- -

-CONFIG_PREEMPT=n kernels implement RCU-sched. -The overall flow of the handling of a given CPU by an RCU-sched -expedited grace period is shown in the following diagram: - -

ExpSchedFlow.svg - -

-As with RCU-preempt, RCU-sched's -synchronize_rcu_expedited() ignores offline and -idle CPUs, again because they are in remotely detectable -quiescent states. -However, because the -rcu_read_lock_sched() and rcu_read_unlock_sched() -leave no trace of their invocation, in general it is not possible to tell -whether or not the current CPU is in an RCU read-side critical section. -The best that RCU-sched's rcu_exp_handler() can do is to check -for idle, on the off-chance that the CPU went idle while the IPI -was in flight. -If the CPU is idle, then rcu_exp_handler() reports -the quiescent state. - -

Otherwise, the handler forces a future context switch by setting the -NEED_RESCHED flag of the current task's thread flag and the CPU preempt -counter. -At the time of the context switch, the CPU reports the quiescent state. -Should the CPU go offline first, it will report the quiescent state -at that time. - -

-Expedited Grace Period and CPU Hotplug

- -

-The expedited nature of expedited grace periods require a much tighter -interaction with CPU hotplug operations than is required for normal -grace periods. -In addition, attempting to IPI offline CPUs will result in splats, but -failing to IPI online CPUs can result in too-short grace periods. -Neither option is acceptable in production kernels. - -

-The interaction between expedited grace periods and CPU hotplug operations -is carried out at several levels: - -

    -
  1. The number of CPUs that have ever been online is tracked - by the rcu_state structure's ->ncpus - field. - The rcu_state structure's ->ncpus_snap - field tracks the number of CPUs that have ever been online - at the beginning of an RCU expedited grace period. - Note that this number never decreases, at least in the absence - of a time machine. -
  2. The identities of the CPUs that have ever been online is - tracked by the rcu_node structure's - ->expmaskinitnext field. - The rcu_node structure's ->expmaskinit - field tracks the identities of the CPUs that were online - at least once at the beginning of the most recent RCU - expedited grace period. - The rcu_state structure's ->ncpus and - ->ncpus_snap fields are used to detect when - new CPUs have come online for the first time, that is, - when the rcu_node structure's ->expmaskinitnext - field has changed since the beginning of the last RCU - expedited grace period, which triggers an update of each - rcu_node structure's ->expmaskinit - field from its ->expmaskinitnext field. -
  3. Each rcu_node structure's ->expmaskinit - field is used to initialize that structure's - ->expmask at the beginning of each RCU - expedited grace period. - This means that only those CPUs that have been online at least - once will be considered for a given grace period. -
  4. Any CPU that goes offline will clear its bit in its leaf - rcu_node structure's ->qsmaskinitnext - field, so any CPU with that bit clear can safely be ignored. - However, it is possible for a CPU coming online or going offline - to have this bit set for some time while cpu_online - returns false. -
  5. For each non-idle CPU that RCU believes is currently online, the grace - period invokes smp_call_function_single(). - If this succeeds, the CPU was fully online. - Failure indicates that the CPU is in the process of coming online - or going offline, in which case it is necessary to wait for a - short time period and try again. - The purpose of this wait (or series of waits, as the case may be) - is to permit a concurrent CPU-hotplug operation to complete. -
  6. In the case of RCU-sched, one of the last acts of an outgoing CPU - is to invoke rcu_report_dead(), which - reports a quiescent state for that CPU. - However, this is likely paranoia-induced redundancy. -
- - - - - - - - -
 
Quick Quiz:
- Why all the dancing around with multiple counters and masks - tracking CPUs that were once online? - Why not just have a single set of masks tracking the currently - online CPUs and be done with it? -
Answer:
- Maintaining single set of masks tracking the online CPUs sounds - easier, at least until you try working out all the race conditions - between grace-period initialization and CPU-hotplug operations. - For example, suppose initialization is progressing down the - tree while a CPU-offline operation is progressing up the tree. - This situation can result in bits set at the top of the tree - that have no counterparts at the bottom of the tree. - Those bits will never be cleared, which will result in - grace-period hangs. - In short, that way lies madness, to say nothing of a great many - bugs, hangs, and deadlocks. - -

- In contrast, the current multi-mask multi-counter scheme ensures - that grace-period initialization will always see consistent masks - up and down the tree, which brings significant simplifications - over the single-mask method. - -

- This is an instance of - - deferring work in order to avoid synchronization. - Lazily recording CPU-hotplug events at the beginning of the next - grace period greatly simplifies maintenance of the CPU-tracking - bitmasks in the rcu_node tree. -

 
- -

-Expedited Grace Period Refinements

- -
    -
  1. Idle-CPU checks. -
  2. - Batching via sequence counter. -
  3. - Funnel locking and wait/wakeup. -
  4. Use of Workqueues. -
  5. Stall warnings. -
  6. Mid-boot operation. -
- -

Idle-CPU Checks

- -

-Each expedited grace period checks for idle CPUs when initially forming -the mask of CPUs to be IPIed and again just before IPIing a CPU -(both checks are carried out by sync_rcu_exp_select_cpus()). -If the CPU is idle at any time between those two times, the CPU will -not be IPIed. -Instead, the task pushing the grace period forward will include the -idle CPUs in the mask passed to rcu_report_exp_cpu_mult(). - -

-For RCU-sched, there is an additional check: -If the IPI has interrupted the idle loop, then -rcu_exp_handler() invokes rcu_report_exp_rdp() -to report the corresponding quiescent state. - -

-For RCU-preempt, there is no specific check for idle in the -IPI handler (rcu_exp_handler()), but because -RCU read-side critical sections are not permitted within the -idle loop, if rcu_exp_handler() sees that the CPU is within -RCU read-side critical section, the CPU cannot possibly be idle. -Otherwise, rcu_exp_handler() invokes -rcu_report_exp_rdp() to report the corresponding quiescent -state, regardless of whether or not that quiescent state was due to -the CPU being idle. - -

-In summary, RCU expedited grace periods check for idle when building -the bitmask of CPUs that must be IPIed, just before sending each IPI, -and (either explicitly or implicitly) within the IPI handler. - -

-Batching via Sequence Counter

- -

-If each grace-period request was carried out separately, expedited -grace periods would have abysmal scalability and -problematic high-load characteristics. -Because each grace-period operation can serve an unlimited number of -updates, it is important to batch requests, so that a single -expedited grace-period operation will cover all requests in the -corresponding batch. - -

-This batching is controlled by a sequence counter named -->expedited_sequence in the rcu_state structure. -This counter has an odd value when there is an expedited grace period -in progress and an even value otherwise, so that dividing the counter -value by two gives the number of completed grace periods. -During any given update request, the counter must transition from -even to odd and then back to even, thus indicating that a grace -period has elapsed. -Therefore, if the initial value of the counter is s, -the updater must wait until the counter reaches at least the -value (s+3)&~0x1. -This counter is managed by the following access functions: - -

    -
  1. rcu_exp_gp_seq_start(), which marks the start of - an expedited grace period. -
  2. rcu_exp_gp_seq_end(), which marks the end of an - expedited grace period. -
  3. rcu_exp_gp_seq_snap(), which obtains a snapshot of - the counter. -
  4. rcu_exp_gp_seq_done(), which returns true - if a full expedited grace period has elapsed since the - corresponding call to rcu_exp_gp_seq_snap(). -
- -

-Again, only one request in a given batch need actually carry out -a grace-period operation, which means there must be an efficient -way to identify which of many concurrent reqeusts will initiate -the grace period, and that there be an efficient way for the -remaining requests to wait for that grace period to complete. -However, that is the topic of the next section. - -

-Funnel Locking and Wait/Wakeup

- -

-The natural way to sort out which of a batch of updaters will initiate -the expedited grace period is to use the rcu_node combining -tree, as implemented by the exp_funnel_lock() function. -The first updater corresponding to a given grace period arriving -at a given rcu_node structure records its desired grace-period -sequence number in the ->exp_seq_rq field and moves up -to the next level in the tree. -Otherwise, if the ->exp_seq_rq field already contains -the sequence number for the desired grace period or some later one, -the updater blocks on one of four wait queues in the -->exp_wq[] array, using the second-from-bottom -and third-from bottom bits as an index. -An ->exp_lock field in the rcu_node structure -synchronizes access to these fields. - -

-An empty rcu_node tree is shown in the following diagram, -with the white cells representing the ->exp_seq_rq field -and the red cells representing the elements of the -->exp_wq[] array. - -

Funnel0.svg - -

-The next diagram shows the situation after the arrival of Task A -and Task B at the leftmost and rightmost leaf rcu_node -structures, respectively. -The current value of the rcu_state structure's -->expedited_sequence field is zero, so adding three and -clearing the bottom bit results in the value two, which both tasks -record in the ->exp_seq_rq field of their respective -rcu_node structures: - -

Funnel1.svg - -

-Each of Tasks A and B will move up to the root -rcu_node structure. -Suppose that Task A wins, recording its desired grace-period sequence -number and resulting in the state shown below: - -

Funnel2.svg - -

-Task A now advances to initiate a new grace period, while Task B -moves up to the root rcu_node structure, and, seeing that -its desired sequence number is already recorded, blocks on -->exp_wq[1]. - - - - - - - - -
 
Quick Quiz:
- Why ->exp_wq[1]? - Given that the value of these tasks' desired sequence number is - two, so shouldn't they instead block on ->exp_wq[2]? -
Answer:
- No. - -

- Recall that the bottom bit of the desired sequence number indicates - whether or not a grace period is currently in progress. - It is therefore necessary to shift the sequence number right one - bit position to obtain the number of the grace period. - This results in ->exp_wq[1]. -

 
- -

-If Tasks C and D also arrive at this point, they will compute the -same desired grace-period sequence number, and see that both leaf -rcu_node structures already have that value recorded. -They will therefore block on their respective rcu_node -structures' ->exp_wq[1] fields, as shown below: - -

Funnel3.svg - -

-Task A now acquires the rcu_state structure's -->exp_mutex and initiates the grace period, which -increments ->expedited_sequence. -Therefore, if Tasks E and F arrive, they will compute -a desired sequence number of 4 and will record this value as -shown below: - -

Funnel4.svg - -

-Tasks E and F will propagate up the rcu_node -combining tree, with Task F blocking on the root rcu_node -structure and Task E wait for Task A to finish so that -it can start the next grace period. -The resulting state is as shown below: - -

Funnel5.svg - -

-Once the grace period completes, Task A -starts waking up the tasks waiting for this grace period to complete, -increments the ->expedited_sequence, -acquires the ->exp_wake_mutex and then releases the -->exp_mutex. -This results in the following state: - -

Funnel6.svg - -

-Task E can then acquire ->exp_mutex and increment -->expedited_sequence to the value three. -If new tasks G and H arrive and moves up the combining tree at the -same time, the state will be as follows: - -

Funnel7.svg - -

-Note that three of the root rcu_node structure's -waitqueues are now occupied. -However, at some point, Task A will wake up the -tasks blocked on the ->exp_wq waitqueues, resulting -in the following state: - -

Funnel8.svg - -

-Execution will continue with Tasks E and H completing -their grace periods and carrying out their wakeups. - - - - - - - - -
 
Quick Quiz:
- What happens if Task A takes so long to do its wakeups - that Task E's grace period completes? -
Answer:
- Then Task E will block on the ->exp_wake_mutex, - which will also prevent it from releasing ->exp_mutex, - which in turn will prevent the next grace period from starting. - This last is important in preventing overflow of the - ->exp_wq[] array. -
 
- -

Use of Workqueues

- -

-In earlier implementations, the task requesting the expedited -grace period also drove it to completion. -This straightforward approach had the disadvantage of needing to -account for POSIX signals sent to user tasks, -so more recent implemementations use the Linux kernel's -workqueues. - -

-The requesting task still does counter snapshotting and funnel-lock -processing, but the task reaching the top of the funnel lock -does a schedule_work() (from _synchronize_rcu_expedited() -so that a workqueue kthread does the actual grace-period processing. -Because workqueue kthreads do not accept POSIX signals, grace-period-wait -processing need not allow for POSIX signals. - -In addition, this approach allows wakeups for the previous expedited -grace period to be overlapped with processing for the next expedited -grace period. -Because there are only four sets of waitqueues, it is necessary to -ensure that the previous grace period's wakeups complete before the -next grace period's wakeups start. -This is handled by having the ->exp_mutex -guard expedited grace-period processing and the -->exp_wake_mutex guard wakeups. -The key point is that the ->exp_mutex is not released -until the first wakeup is complete, which means that the -->exp_wake_mutex has already been acquired at that point. -This approach ensures that the previous grace period's wakeups can -be carried out while the current grace period is in process, but -that these wakeups will complete before the next grace period starts. -This means that only three waitqueues are required, guaranteeing that -the four that are provided are sufficient. - -

Stall Warnings

- -

-Expediting grace periods does nothing to speed things up when RCU -readers take too long, and therefore expedited grace periods check -for stalls just as normal grace periods do. - - - - - - - - -
 
Quick Quiz:
- But why not just let the normal grace-period machinery - detect the stalls, given that a given reader must block - both normal and expedited grace periods? -
Answer:
- Because it is quite possible that at a given time there - is no normal grace period in progress, in which case the - normal grace period cannot emit a stall warning. -
 
- -The synchronize_sched_expedited_wait() function loops waiting -for the expedited grace period to end, but with a timeout set to the -current RCU CPU stall-warning time. -If this time is exceeded, any CPUs or rcu_node structures -blocking the current grace period are printed. -Each stall warning results in another pass through the loop, but the -second and subsequent passes use longer stall times. - -

Mid-boot operation

- -

-The use of workqueues has the advantage that the expedited -grace-period code need not worry about POSIX signals. -Unfortunately, it has the -corresponding disadvantage that workqueues cannot be used until -they are initialized, which does not happen until some time after -the scheduler spawns the first task. -Given that there are parts of the kernel that really do want to -execute grace periods during this mid-boot “dead zone”, -expedited grace periods must do something else during thie time. - -

-What they do is to fall back to the old practice of requiring that the -requesting task drive the expedited grace period, as was the case -before the use of workqueues. -However, the requesting task is only required to drive the grace period -during the mid-boot dead zone. -Before mid-boot, a synchronous grace period is a no-op. -Some time after mid-boot, workqueues are used. - -

-Non-expedited non-SRCU synchronous grace periods must also operate -normally during mid-boot. -This is handled by causing non-expedited grace periods to take the -expedited code path during mid-boot. - -

-The current code assumes that there are no POSIX signals during -the mid-boot dead zone. -However, if an overwhelming need for POSIX signals somehow arises, -appropriate adjustments can be made to the expedited stall-warning code. -One such adjustment would reinstate the pre-workqueue stall-warning -checks, but only during the mid-boot dead zone. - -

-With this refinement, synchronous grace periods can now be used from -task context pretty much any time during the life of the kernel. -That is, aside from some points in the suspend, hibernate, or shutdown -code path. - -

-Summary

- -

-Expedited grace periods use a sequence-number approach to promote -batching, so that a single grace-period operation can serve numerous -requests. -A funnel lock is used to efficiently identify the one task out of -a concurrent group that will request the grace period. -All members of the group will block on waitqueues provided in -the rcu_node structure. -The actual grace-period processing is carried out by a workqueue. - -

-CPU-hotplug operations are noted lazily in order to prevent the need -for tight synchronization between expedited grace periods and -CPU-hotplug operations. -The dyntick-idle counters are used to avoid sending IPIs to idle CPUs, -at least in the common case. -RCU-preempt and RCU-sched use different IPI handlers and different -code to respond to the state changes carried out by those handlers, -but otherwise use common code. - -

-Quiescent states are tracked using the rcu_node tree, -and once all necessary quiescent states have been reported, -all tasks waiting on this expedited grace period are awakened. -A pair of mutexes are used to allow one grace period's wakeups -to proceed concurrently with the next grace period's processing. - -

-This combination of mechanisms allows expedited grace periods to -run reasonably efficiently. -However, for non-time-critical tasks, normal grace periods should be -used instead because their longer duration permits much higher -degrees of batching, and thus much lower per-request overheads. - - diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst new file mode 100644 index 0000000000000000000000000000000000000000..72f0f6fbd53c08e147362ea520456f20f11a0c8f --- /dev/null +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst @@ -0,0 +1,521 @@ +================================================= +A Tour Through TREE_RCU's Expedited Grace Periods +================================================= + +Introduction +============ + +This document describes RCU's expedited grace periods. +Unlike RCU's normal grace periods, which accept long latencies to attain +high efficiency and minimal disturbance, expedited grace periods accept +lower efficiency and significant disturbance to attain shorter latencies. + +There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier +third RCU-bh flavor having been implemented in terms of the other two. +Each of the two implementations is covered in its own section. + +Expedited Grace Period Design +============================= + +The expedited RCU grace periods cannot be accused of being subtle, +given that they for all intents and purposes hammer every CPU that +has not yet provided a quiescent state for the current expedited +grace period. +The one saving grace is that the hammer has grown a bit smaller +over time: The old call to ``try_stop_cpus()`` has been +replaced with a set of calls to ``smp_call_function_single()``, +each of which results in an IPI to the target CPU. +The corresponding handler function checks the CPU's state, motivating +a faster quiescent state where possible, and triggering a report +of that quiescent state. +As always for RCU, once everything has spent some time in a quiescent +state, the expedited grace period has completed. + +The details of the ``smp_call_function_single()`` handler's +operation depend on the RCU flavor, as described in the following +sections. + +RCU-preempt Expedited Grace Periods +=================================== + +``CONFIG_PREEMPT=y`` kernels implement RCU-preempt. +The overall flow of the handling of a given CPU by an RCU-preempt +expedited grace period is shown in the following diagram: + +.. kernel-figure:: ExpRCUFlow.svg + +The solid arrows denote direct action, for example, a function call. +The dotted arrows denote indirect action, for example, an IPI +or a state that is reached after some time. + +If a given CPU is offline or idle, ``synchronize_rcu_expedited()`` +will ignore it because idle and offline CPUs are already residing +in quiescent states. +Otherwise, the expedited grace period will use +``smp_call_function_single()`` to send the CPU an IPI, which +is handled by ``rcu_exp_handler()``. + +However, because this is preemptible RCU, ``rcu_exp_handler()`` +can check to see if the CPU is currently running in an RCU read-side +critical section. +If not, the handler can immediately report a quiescent state. +Otherwise, it sets flags so that the outermost ``rcu_read_unlock()`` +invocation will provide the needed quiescent-state report. +This flag-setting avoids the previous forced preemption of all +CPUs that might have RCU read-side critical sections. +In addition, this flag-setting is done so as to avoid increasing +the overhead of the common-case fastpath through the scheduler. + +Again because this is preemptible RCU, an RCU read-side critical section +can be preempted. +When that happens, RCU will enqueue the task, which will the continue to +block the current expedited grace period until it resumes and finds its +outermost ``rcu_read_unlock()``. +The CPU will report a quiescent state just after enqueuing the task because +the CPU is no longer blocking the grace period. +It is instead the preempted task doing the blocking. +The list of blocked tasks is managed by ``rcu_preempt_ctxt_queue()``, +which is called from ``rcu_preempt_note_context_switch()``, which +in turn is called from ``rcu_note_context_switch()``, which in +turn is called from the scheduler. + + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why not just have the expedited grace period check the state of all | +| the CPUs? After all, that would avoid all those real-time-unfriendly | +| IPIs. | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because we want the RCU read-side critical sections to run fast, | +| which means no memory barriers. Therefore, it is not possible to | +| safely check the state from some other CPU. And even if it was | +| possible to safely check the state, it would still be necessary to | +| IPI the CPU to safely interact with the upcoming | +| ``rcu_read_unlock()`` invocation, which means that the remote state | +| testing would not help the worst-case latency that real-time | +| applications care about. | +| | +| One way to prevent your real-time application from getting hit with | +| these IPIs is to build your kernel with ``CONFIG_NO_HZ_FULL=y``. RCU | +| would then perceive the CPU running your application as being idle, | +| and it would be able to safely detect that state without needing to | +| IPI the CPU. | ++-----------------------------------------------------------------------+ + +Please note that this is just the overall flow: Additional complications +can arise due to races with CPUs going idle or offline, among other +things. + +RCU-sched Expedited Grace Periods +--------------------------------- + +``CONFIG_PREEMPT=n`` kernels implement RCU-sched. The overall flow of +the handling of a given CPU by an RCU-sched expedited grace period is +shown in the following diagram: + +.. kernel-figure:: ExpSchedFlow.svg + +As with RCU-preempt, RCU-sched's ``synchronize_rcu_expedited()`` ignores +offline and idle CPUs, again because they are in remotely detectable +quiescent states. However, because the ``rcu_read_lock_sched()`` and +``rcu_read_unlock_sched()`` leave no trace of their invocation, in +general it is not possible to tell whether or not the current CPU is in +an RCU read-side critical section. The best that RCU-sched's +``rcu_exp_handler()`` can do is to check for idle, on the off-chance +that the CPU went idle while the IPI was in flight. If the CPU is idle, +then ``rcu_exp_handler()`` reports the quiescent state. + +Otherwise, the handler forces a future context switch by setting the +NEED_RESCHED flag of the current task's thread flag and the CPU preempt +counter. At the time of the context switch, the CPU reports the +quiescent state. Should the CPU go offline first, it will report the +quiescent state at that time. + +Expedited Grace Period and CPU Hotplug +-------------------------------------- + +The expedited nature of expedited grace periods require a much tighter +interaction with CPU hotplug operations than is required for normal +grace periods. In addition, attempting to IPI offline CPUs will result +in splats, but failing to IPI online CPUs can result in too-short grace +periods. Neither option is acceptable in production kernels. + +The interaction between expedited grace periods and CPU hotplug +operations is carried out at several levels: + +#. The number of CPUs that have ever been online is tracked by the + ``rcu_state`` structure's ``->ncpus`` field. The ``rcu_state`` + structure's ``->ncpus_snap`` field tracks the number of CPUs that + have ever been online at the beginning of an RCU expedited grace + period. Note that this number never decreases, at least in the + absence of a time machine. +#. The identities of the CPUs that have ever been online is tracked by + the ``rcu_node`` structure's ``->expmaskinitnext`` field. The + ``rcu_node`` structure's ``->expmaskinit`` field tracks the + identities of the CPUs that were online at least once at the + beginning of the most recent RCU expedited grace period. The + ``rcu_state`` structure's ``->ncpus`` and ``->ncpus_snap`` fields are + used to detect when new CPUs have come online for the first time, + that is, when the ``rcu_node`` structure's ``->expmaskinitnext`` + field has changed since the beginning of the last RCU expedited grace + period, which triggers an update of each ``rcu_node`` structure's + ``->expmaskinit`` field from its ``->expmaskinitnext`` field. +#. Each ``rcu_node`` structure's ``->expmaskinit`` field is used to + initialize that structure's ``->expmask`` at the beginning of each + RCU expedited grace period. This means that only those CPUs that have + been online at least once will be considered for a given grace + period. +#. Any CPU that goes offline will clear its bit in its leaf ``rcu_node`` + structure's ``->qsmaskinitnext`` field, so any CPU with that bit + clear can safely be ignored. However, it is possible for a CPU coming + online or going offline to have this bit set for some time while + ``cpu_online`` returns ``false``. +#. For each non-idle CPU that RCU believes is currently online, the + grace period invokes ``smp_call_function_single()``. If this + succeeds, the CPU was fully online. Failure indicates that the CPU is + in the process of coming online or going offline, in which case it is + necessary to wait for a short time period and try again. The purpose + of this wait (or series of waits, as the case may be) is to permit a + concurrent CPU-hotplug operation to complete. +#. In the case of RCU-sched, one of the last acts of an outgoing CPU is + to invoke ``rcu_report_dead()``, which reports a quiescent state for + that CPU. However, this is likely paranoia-induced redundancy. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why all the dancing around with multiple counters and masks tracking | +| CPUs that were once online? Why not just have a single set of masks | +| tracking the currently online CPUs and be done with it? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Maintaining single set of masks tracking the online CPUs *sounds* | +| easier, at least until you try working out all the race conditions | +| between grace-period initialization and CPU-hotplug operations. For | +| example, suppose initialization is progressing down the tree while a | +| CPU-offline operation is progressing up the tree. This situation can | +| result in bits set at the top of the tree that have no counterparts | +| at the bottom of the tree. Those bits will never be cleared, which | +| will result in grace-period hangs. In short, that way lies madness, | +| to say nothing of a great many bugs, hangs, and deadlocks. | +| In contrast, the current multi-mask multi-counter scheme ensures that | +| grace-period initialization will always see consistent masks up and | +| down the tree, which brings significant simplifications over the | +| single-mask method. | +| | +| This is an instance of `deferring work in order to avoid | +| synchronization `__. | +| Lazily recording CPU-hotplug events at the beginning of the next | +| grace period greatly simplifies maintenance of the CPU-tracking | +| bitmasks in the ``rcu_node`` tree. | ++-----------------------------------------------------------------------+ + +Expedited Grace Period Refinements +---------------------------------- + +Idle-CPU Checks +~~~~~~~~~~~~~~~ + +Each expedited grace period checks for idle CPUs when initially forming +the mask of CPUs to be IPIed and again just before IPIing a CPU (both +checks are carried out by ``sync_rcu_exp_select_cpus()``). If the CPU is +idle at any time between those two times, the CPU will not be IPIed. +Instead, the task pushing the grace period forward will include the idle +CPUs in the mask passed to ``rcu_report_exp_cpu_mult()``. + +For RCU-sched, there is an additional check: If the IPI has interrupted +the idle loop, then ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state. + +For RCU-preempt, there is no specific check for idle in the IPI handler +(``rcu_exp_handler()``), but because RCU read-side critical sections are +not permitted within the idle loop, if ``rcu_exp_handler()`` sees that +the CPU is within RCU read-side critical section, the CPU cannot +possibly be idle. Otherwise, ``rcu_exp_handler()`` invokes +``rcu_report_exp_rdp()`` to report the corresponding quiescent state, +regardless of whether or not that quiescent state was due to the CPU +being idle. + +In summary, RCU expedited grace periods check for idle when building the +bitmask of CPUs that must be IPIed, just before sending each IPI, and +(either explicitly or implicitly) within the IPI handler. + +Batching via Sequence Counter +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If each grace-period request was carried out separately, expedited grace +periods would have abysmal scalability and problematic high-load +characteristics. Because each grace-period operation can serve an +unlimited number of updates, it is important to *batch* requests, so +that a single expedited grace-period operation will cover all requests +in the corresponding batch. + +This batching is controlled by a sequence counter named +``->expedited_sequence`` in the ``rcu_state`` structure. This counter +has an odd value when there is an expedited grace period in progress and +an even value otherwise, so that dividing the counter value by two gives +the number of completed grace periods. During any given update request, +the counter must transition from even to odd and then back to even, thus +indicating that a grace period has elapsed. Therefore, if the initial +value of the counter is ``s``, the updater must wait until the counter +reaches at least the value ``(s+3)&~0x1``. This counter is managed by +the following access functions: + +#. ``rcu_exp_gp_seq_start()``, which marks the start of an expedited + grace period. +#. ``rcu_exp_gp_seq_end()``, which marks the end of an expedited grace + period. +#. ``rcu_exp_gp_seq_snap()``, which obtains a snapshot of the counter. +#. ``rcu_exp_gp_seq_done()``, which returns ``true`` if a full expedited + grace period has elapsed since the corresponding call to + ``rcu_exp_gp_seq_snap()``. + +Again, only one request in a given batch need actually carry out a +grace-period operation, which means there must be an efficient way to +identify which of many concurrent reqeusts will initiate the grace +period, and that there be an efficient way for the remaining requests to +wait for that grace period to complete. However, that is the topic of +the next section. + +Funnel Locking and Wait/Wakeup +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The natural way to sort out which of a batch of updaters will initiate +the expedited grace period is to use the ``rcu_node`` combining tree, as +implemented by the ``exp_funnel_lock()`` function. The first updater +corresponding to a given grace period arriving at a given ``rcu_node`` +structure records its desired grace-period sequence number in the +``->exp_seq_rq`` field and moves up to the next level in the tree. +Otherwise, if the ``->exp_seq_rq`` field already contains the sequence +number for the desired grace period or some later one, the updater +blocks on one of four wait queues in the ``->exp_wq[]`` array, using the +second-from-bottom and third-from bottom bits as an index. An +``->exp_lock`` field in the ``rcu_node`` structure synchronizes access +to these fields. + +An empty ``rcu_node`` tree is shown in the following diagram, with the +white cells representing the ``->exp_seq_rq`` field and the red cells +representing the elements of the ``->exp_wq[]`` array. + +.. kernel-figure:: Funnel0.svg + +The next diagram shows the situation after the arrival of Task A and +Task B at the leftmost and rightmost leaf ``rcu_node`` structures, +respectively. The current value of the ``rcu_state`` structure's +``->expedited_sequence`` field is zero, so adding three and clearing the +bottom bit results in the value two, which both tasks record in the +``->exp_seq_rq`` field of their respective ``rcu_node`` structures: + +.. kernel-figure:: Funnel1.svg + +Each of Tasks A and B will move up to the root ``rcu_node`` structure. +Suppose that Task A wins, recording its desired grace-period sequence +number and resulting in the state shown below: + +.. kernel-figure:: Funnel2.svg + +Task A now advances to initiate a new grace period, while Task B moves +up to the root ``rcu_node`` structure, and, seeing that its desired +sequence number is already recorded, blocks on ``->exp_wq[1]``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why ``->exp_wq[1]``? Given that the value of these tasks' desired | +| sequence number is two, so shouldn't they instead block on | +| ``->exp_wq[2]``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No. | +| Recall that the bottom bit of the desired sequence number indicates | +| whether or not a grace period is currently in progress. It is | +| therefore necessary to shift the sequence number right one bit | +| position to obtain the number of the grace period. This results in | +| ``->exp_wq[1]``. | ++-----------------------------------------------------------------------+ + +If Tasks C and D also arrive at this point, they will compute the same +desired grace-period sequence number, and see that both leaf +``rcu_node`` structures already have that value recorded. They will +therefore block on their respective ``rcu_node`` structures' +``->exp_wq[1]`` fields, as shown below: + +.. kernel-figure:: Funnel3.svg + +Task A now acquires the ``rcu_state`` structure's ``->exp_mutex`` and +initiates the grace period, which increments ``->expedited_sequence``. +Therefore, if Tasks E and F arrive, they will compute a desired sequence +number of 4 and will record this value as shown below: + +.. kernel-figure:: Funnel4.svg + +Tasks E and F will propagate up the ``rcu_node`` combining tree, with +Task F blocking on the root ``rcu_node`` structure and Task E wait for +Task A to finish so that it can start the next grace period. The +resulting state is as shown below: + +.. kernel-figure:: Funnel5.svg + +Once the grace period completes, Task A starts waking up the tasks +waiting for this grace period to complete, increments the +``->expedited_sequence``, acquires the ``->exp_wake_mutex`` and then +releases the ``->exp_mutex``. This results in the following state: + +.. kernel-figure:: Funnel6.svg + +Task E can then acquire ``->exp_mutex`` and increment +``->expedited_sequence`` to the value three. If new tasks G and H arrive +and moves up the combining tree at the same time, the state will be as +follows: + +.. kernel-figure:: Funnel7.svg + +Note that three of the root ``rcu_node`` structure's waitqueues are now +occupied. However, at some point, Task A will wake up the tasks blocked +on the ``->exp_wq`` waitqueues, resulting in the following state: + +.. kernel-figure:: Funnel8.svg + +Execution will continue with Tasks E and H completing their grace +periods and carrying out their wakeups. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What happens if Task A takes so long to do its wakeups that Task E's | +| grace period completes? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Then Task E will block on the ``->exp_wake_mutex``, which will also | +| prevent it from releasing ``->exp_mutex``, which in turn will prevent | +| the next grace period from starting. This last is important in | +| preventing overflow of the ``->exp_wq[]`` array. | ++-----------------------------------------------------------------------+ + +Use of Workqueues +~~~~~~~~~~~~~~~~~ + +In earlier implementations, the task requesting the expedited grace +period also drove it to completion. This straightforward approach had +the disadvantage of needing to account for POSIX signals sent to user +tasks, so more recent implemementations use the Linux kernel's +`workqueues `__. + +The requesting task still does counter snapshotting and funnel-lock +processing, but the task reaching the top of the funnel lock does a +``schedule_work()`` (from ``_synchronize_rcu_expedited()`` so that a +workqueue kthread does the actual grace-period processing. Because +workqueue kthreads do not accept POSIX signals, grace-period-wait +processing need not allow for POSIX signals. In addition, this approach +allows wakeups for the previous expedited grace period to be overlapped +with processing for the next expedited grace period. Because there are +only four sets of waitqueues, it is necessary to ensure that the +previous grace period's wakeups complete before the next grace period's +wakeups start. This is handled by having the ``->exp_mutex`` guard +expedited grace-period processing and the ``->exp_wake_mutex`` guard +wakeups. The key point is that the ``->exp_mutex`` is not released until +the first wakeup is complete, which means that the ``->exp_wake_mutex`` +has already been acquired at that point. This approach ensures that the +previous grace period's wakeups can be carried out while the current +grace period is in process, but that these wakeups will complete before +the next grace period starts. This means that only three waitqueues are +required, guaranteeing that the four that are provided are sufficient. + +Stall Warnings +~~~~~~~~~~~~~~ + +Expediting grace periods does nothing to speed things up when RCU +readers take too long, and therefore expedited grace periods check for +stalls just as normal grace periods do. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But why not just let the normal grace-period machinery detect the | +| stalls, given that a given reader must block both normal and | +| expedited grace periods? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because it is quite possible that at a given time there is no normal | +| grace period in progress, in which case the normal grace period | +| cannot emit a stall warning. | ++-----------------------------------------------------------------------+ + +The ``synchronize_sched_expedited_wait()`` function loops waiting for +the expedited grace period to end, but with a timeout set to the current +RCU CPU stall-warning time. If this time is exceeded, any CPUs or +``rcu_node`` structures blocking the current grace period are printed. +Each stall warning results in another pass through the loop, but the +second and subsequent passes use longer stall times. + +Mid-boot operation +~~~~~~~~~~~~~~~~~~ + +The use of workqueues has the advantage that the expedited grace-period +code need not worry about POSIX signals. Unfortunately, it has the +corresponding disadvantage that workqueues cannot be used until they are +initialized, which does not happen until some time after the scheduler +spawns the first task. Given that there are parts of the kernel that +really do want to execute grace periods during this mid-boot “dead +zone”, expedited grace periods must do something else during thie time. + +What they do is to fall back to the old practice of requiring that the +requesting task drive the expedited grace period, as was the case before +the use of workqueues. However, the requesting task is only required to +drive the grace period during the mid-boot dead zone. Before mid-boot, a +synchronous grace period is a no-op. Some time after mid-boot, +workqueues are used. + +Non-expedited non-SRCU synchronous grace periods must also operate +normally during mid-boot. This is handled by causing non-expedited grace +periods to take the expedited code path during mid-boot. + +The current code assumes that there are no POSIX signals during the +mid-boot dead zone. However, if an overwhelming need for POSIX signals +somehow arises, appropriate adjustments can be made to the expedited +stall-warning code. One such adjustment would reinstate the +pre-workqueue stall-warning checks, but only during the mid-boot dead +zone. + +With this refinement, synchronous grace periods can now be used from +task context pretty much any time during the life of the kernel. That +is, aside from some points in the suspend, hibernate, or shutdown code +path. + +Summary +~~~~~~~ + +Expedited grace periods use a sequence-number approach to promote +batching, so that a single grace-period operation can serve numerous +requests. A funnel lock is used to efficiently identify the one task out +of a concurrent group that will request the grace period. All members of +the group will block on waitqueues provided in the ``rcu_node`` +structure. The actual grace-period processing is carried out by a +workqueue. + +CPU-hotplug operations are noted lazily in order to prevent the need for +tight synchronization between expedited grace periods and CPU-hotplug +operations. The dyntick-idle counters are used to avoid sending IPIs to +idle CPUs, at least in the common case. RCU-preempt and RCU-sched use +different IPI handlers and different code to respond to the state +changes carried out by those handlers, but otherwise use common code. + +Quiescent states are tracked using the ``rcu_node`` tree, and once all +necessary quiescent states have been reported, all tasks waiting on this +expedited grace period are awakened. A pair of mutexes are used to allow +one grace period's wakeups to proceed concurrently with the next grace +period's processing. + +This combination of mechanisms allows expedited grace periods to run +reasonably efficiently. However, for non-time-critical tasks, normal +grace periods should be used instead because their longer duration +permits much higher degrees of batching, and thus much lower per-request +overheads. diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html deleted file mode 100644 index e5b42a798ff3bfff98c19c78aaf61770c04e3e3d..0000000000000000000000000000000000000000 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Diagram.html +++ /dev/null @@ -1,9 +0,0 @@ - - - A Diagram of TREE_RCU's Grace-Period Memory Ordering - - -

TreeRCU-gp.svg - - diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html deleted file mode 100644 index c64f8d26609fb64ae34dfdf551624984f38e4c0c..0000000000000000000000000000000000000000 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ /dev/null @@ -1,704 +0,0 @@ - - - A Tour Through TREE_RCU's Grace-Period Memory Ordering - - -

August 8, 2017

-

This article was contributed by Paul E. McKenney

- -

Introduction

- -

This document gives a rough visual overview of how Tree RCU's -grace-period memory ordering guarantee is provided. - -

    -
  1. - What Is Tree RCU's Grace Period Memory Ordering Guarantee? -
  2. - Tree RCU Grace Period Memory Ordering Building Blocks -
  3. - Tree RCU Grace Period Memory Ordering Components -
  4. Putting It All Together -
- -

-What Is Tree RCU's Grace Period Memory Ordering Guarantee?

- -

RCU grace periods provide extremely strong memory-ordering guarantees -for non-idle non-offline code. -Any code that happens after the end of a given RCU grace period is guaranteed -to see the effects of all accesses prior to the beginning of that grace -period that are within RCU read-side critical sections. -Similarly, any code that happens before the beginning of a given RCU grace -period is guaranteed to see the effects of all accesses following the end -of that grace period that are within RCU read-side critical sections. - -

Note well that RCU-sched read-side critical sections include any region -of code for which preemption is disabled. -Given that each individual machine instruction can be thought of as -an extremely small region of preemption-disabled code, one can think of -synchronize_rcu() as smp_mb() on steroids. - -

RCU updaters use this guarantee by splitting their updates into -two phases, one of which is executed before the grace period and -the other of which is executed after the grace period. -In the most common use case, phase one removes an element from -a linked RCU-protected data structure, and phase two frees that element. -For this to work, any readers that have witnessed state prior to the -phase-one update (in the common case, removal) must not witness state -following the phase-two update (in the common case, freeing). - -

The RCU implementation provides this guarantee using a network -of lock-based critical sections, memory barriers, and per-CPU -processing, as is described in the following sections. - -

-Tree RCU Grace Period Memory Ordering Building Blocks

- -

The workhorse for RCU's grace-period memory ordering is the -critical section for the rcu_node structure's -->lock. -These critical sections use helper functions for lock acquisition, including -raw_spin_lock_rcu_node(), -raw_spin_lock_irq_rcu_node(), and -raw_spin_lock_irqsave_rcu_node(). -Their lock-release counterparts are -raw_spin_unlock_rcu_node(), -raw_spin_unlock_irq_rcu_node(), and -raw_spin_unlock_irqrestore_rcu_node(), -respectively. -For completeness, a -raw_spin_trylock_rcu_node() -is also provided. -The key point is that the lock-acquisition functions, including -raw_spin_trylock_rcu_node(), all invoke -smp_mb__after_unlock_lock() immediately after successful -acquisition of the lock. - -

Therefore, for any given rcu_node structure, any access -happening before one of the above lock-release functions will be seen -by all CPUs as happening before any access happening after a later -one of the above lock-acquisition functions. -Furthermore, any access happening before one of the -above lock-release function on any given CPU will be seen by all -CPUs as happening before any access happening after a later one -of the above lock-acquisition functions executing on that same CPU, -even if the lock-release and lock-acquisition functions are operating -on different rcu_node structures. -Tree RCU uses these two ordering guarantees to form an ordering -network among all CPUs that were in any way involved in the grace -period, including any CPUs that came online or went offline during -the grace period in question. - -

The following litmus test exhibits the ordering effects of these -lock-acquisition and lock-release functions: - -

- 1 int x, y, z;
- 2
- 3 void task0(void)
- 4 {
- 5   raw_spin_lock_rcu_node(rnp);
- 6   WRITE_ONCE(x, 1);
- 7   r1 = READ_ONCE(y);
- 8   raw_spin_unlock_rcu_node(rnp);
- 9 }
-10
-11 void task1(void)
-12 {
-13   raw_spin_lock_rcu_node(rnp);
-14   WRITE_ONCE(y, 1);
-15   r2 = READ_ONCE(z);
-16   raw_spin_unlock_rcu_node(rnp);
-17 }
-18
-19 void task2(void)
-20 {
-21   WRITE_ONCE(z, 1);
-22   smp_mb();
-23   r3 = READ_ONCE(x);
-24 }
-25
-26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
-
- -

The WARN_ON() is evaluated at “the end of time”, -after all changes have propagated throughout the system. -Without the smp_mb__after_unlock_lock() provided by the -acquisition functions, this WARN_ON() could trigger, for example -on PowerPC. -The smp_mb__after_unlock_lock() invocations prevent this -WARN_ON() from triggering. - -

This approach must be extended to include idle CPUs, which need -RCU's grace-period memory ordering guarantee to extend to any -RCU read-side critical sections preceding and following the current -idle sojourn. -This case is handled by calls to the strongly ordered -atomic_add_return() read-modify-write atomic operation that -is invoked within rcu_dynticks_eqs_enter() at idle-entry -time and within rcu_dynticks_eqs_exit() at idle-exit time. -The grace-period kthread invokes rcu_dynticks_snap() and -rcu_dynticks_in_eqs_since() (both of which invoke -an atomic_add_return() of zero) to detect idle CPUs. - - - - - - - - -
 
Quick Quiz:
- But what about CPUs that remain offline for the entire - grace period? -
Answer:
- Such CPUs will be offline at the beginning of the grace period, - so the grace period won't expect quiescent states from them. - Races between grace-period start and CPU-hotplug operations - are mediated by the CPU's leaf rcu_node structure's - ->lock as described above. -
 
- -

The approach must be extended to handle one final case, that -of waking a task blocked in synchronize_rcu(). -This task might be affinitied to a CPU that is not yet aware that -the grace period has ended, and thus might not yet be subject to -the grace period's memory ordering. -Therefore, there is an smp_mb() after the return from -wait_for_completion() in the synchronize_rcu() -code path. - - - - - - - - -
 
Quick Quiz:
- What? Where??? - I don't see any smp_mb() after the return from - wait_for_completion()!!! -
Answer:
- That would be because I spotted the need for that - smp_mb() during the creation of this documentation, - and it is therefore unlikely to hit mainline before v4.14. - Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and - Jonathan Cameron for asking questions that sensitized me - to the rather elaborate sequence of events that demonstrate - the need for this memory barrier. -
 
- -

Tree RCU's grace--period memory-ordering guarantees rely most -heavily on the rcu_node structure's ->lock -field, so much so that it is necessary to abbreviate this pattern -in the diagrams in the next section. -For example, consider the rcu_prepare_for_idle() function -shown below, which is one of several functions that enforce ordering -of newly arrived RCU callbacks against future grace periods: - -

- 1 static void rcu_prepare_for_idle(void)
- 2 {
- 3   bool needwake;
- 4   struct rcu_data *rdp;
- 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
- 6   struct rcu_node *rnp;
- 7   struct rcu_state *rsp;
- 8   int tne;
- 9
-10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
-11       rcu_is_nocb_cpu(smp_processor_id()))
-12     return;
-13   tne = READ_ONCE(tick_nohz_active);
-14   if (tne != rdtp->tick_nohz_enabled_snap) {
-15     if (rcu_cpu_has_callbacks(NULL))
-16       invoke_rcu_core();
-17     rdtp->tick_nohz_enabled_snap = tne;
-18     return;
-19   }
-20   if (!tne)
-21     return;
-22   if (rdtp->all_lazy &&
-23       rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
-24     rdtp->all_lazy = false;
-25     rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
-26     invoke_rcu_core();
-27     return;
-28   }
-29   if (rdtp->last_accelerate == jiffies)
-30     return;
-31   rdtp->last_accelerate = jiffies;
-32   for_each_rcu_flavor(rsp) {
-33     rdp = this_cpu_ptr(rsp->rda);
-34     if (rcu_segcblist_pend_cbs(&rdp->cblist))
-35       continue;
-36     rnp = rdp->mynode;
-37     raw_spin_lock_rcu_node(rnp);
-38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
-39     raw_spin_unlock_rcu_node(rnp);
-40     if (needwake)
-41       rcu_gp_kthread_wake(rsp);
-42   }
-43 }
-
- -

But the only part of rcu_prepare_for_idle() that really -matters for this discussion are lines 37–39. -We will therefore abbreviate this function as follows: - -

rcu_node-lock.svg - -

The box represents the rcu_node structure's ->lock -critical section, with the double line on top representing the additional -smp_mb__after_unlock_lock(). - -

-Tree RCU Grace Period Memory Ordering Components

- -

Tree RCU's grace-period memory-ordering guarantee is provided by -a number of RCU components: - -

    -
  1. Callback Registry -
  2. Grace-Period Initialization -
  3. - Self-Reported Quiescent States -
  4. Dynamic Tick Interface -
  5. CPU-Hotplug Interface -
  6. Forcing Quiescent States -
  7. Grace-Period Cleanup -
  8. Callback Invocation -
- -

Each of the following section looks at the corresponding component -in detail. - -

Callback Registry

- -

If RCU's grace-period guarantee is to mean anything at all, any -access that happens before a given invocation of call_rcu() -must also happen before the corresponding grace period. -The implementation of this portion of RCU's grace period guarantee -is shown in the following figure: - -

TreeRCU-callback-registry.svg - -

Because call_rcu() normally acts only on CPU-local state, -it provides no ordering guarantees, either for itself or for -phase one of the update (which again will usually be removal of -an element from an RCU-protected data structure). -It simply enqueues the rcu_head structure on a per-CPU list, -which cannot become associated with a grace period until a later -call to rcu_accelerate_cbs(), as shown in the diagram above. - -

One set of code paths shown on the left invokes -rcu_accelerate_cbs() via -note_gp_changes(), either directly from call_rcu() (if -the current CPU is inundated with queued rcu_head structures) -or more likely from an RCU_SOFTIRQ handler. -Another code path in the middle is taken only in kernels built with -CONFIG_RCU_FAST_NO_HZ=y, which invokes -rcu_accelerate_cbs() via rcu_prepare_for_idle(). -The final code path on the right is taken only in kernels built with -CONFIG_HOTPLUG_CPU=y, which invokes -rcu_accelerate_cbs() via -rcu_advance_cbs(), rcu_migrate_callbacks, -rcutree_migrate_callbacks(), and takedown_cpu(), -which in turn is invoked on a surviving CPU after the outgoing -CPU has been completely offlined. - -

There are a few other code paths within grace-period processing -that opportunistically invoke rcu_accelerate_cbs(). -However, either way, all of the CPU's recently queued rcu_head -structures are associated with a future grace-period number under -the protection of the CPU's lead rcu_node structure's -->lock. -In all cases, there is full ordering against any prior critical section -for that same rcu_node structure's ->lock, and -also full ordering against any of the current task's or CPU's prior critical -sections for any rcu_node structure's ->lock. - -

The next section will show how this ordering ensures that any -accesses prior to the call_rcu() (particularly including phase -one of the update) -happen before the start of the corresponding grace period. - - - - - - - - -
 
Quick Quiz:
- But what about synchronize_rcu()? -
Answer:
- The synchronize_rcu() passes call_rcu() - to wait_rcu_gp(), which invokes it. - So either way, it eventually comes down to call_rcu(). -
 
- -

Grace-Period Initialization

- -

Grace-period initialization is carried out by -the grace-period kernel thread, which makes several passes over the -rcu_node tree within the rcu_gp_init() function. -This means that showing the full flow of ordering through the -grace-period computation will require duplicating this tree. -If you find this confusing, please note that the state of the -rcu_node changes over time, just like Heraclitus's river. -However, to keep the rcu_node river tractable, the -grace-period kernel thread's traversals are presented in multiple -parts, starting in this section with the various phases of -grace-period initialization. - -

The first ordering-related grace-period initialization action is to -advance the rcu_state structure's ->gp_seq -grace-period-number counter, as shown below: - -

TreeRCU-gp-init-1.svg - -

The actual increment is carried out using smp_store_release(), -which helps reject false-positive RCU CPU stall detection. -Note that only the root rcu_node structure is touched. - -

The first pass through the rcu_node tree updates bitmasks -based on CPUs having come online or gone offline since the start of -the previous grace period. -In the common case where the number of online CPUs for this rcu_node -structure has not transitioned to or from zero, -this pass will scan only the leaf rcu_node structures. -However, if the number of online CPUs for a given leaf rcu_node -structure has transitioned from zero, -rcu_init_new_rnp() will be invoked for the first incoming CPU. -Similarly, if the number of online CPUs for a given leaf rcu_node -structure has transitioned to zero, -rcu_cleanup_dead_rnp() will be invoked for the last outgoing CPU. -The diagram below shows the path of ordering if the leftmost -rcu_node structure onlines its first CPU and if the next -rcu_node structure has no online CPUs -(or, alternatively if the leftmost rcu_node structure offlines -its last CPU and if the next rcu_node structure has no online CPUs). - -

TreeRCU-gp-init-1.svg - -

The final rcu_gp_init() pass through the rcu_node -tree traverses breadth-first, setting each rcu_node structure's -->gp_seq field to the newly advanced value from the -rcu_state structure, as shown in the following diagram. - -

TreeRCU-gp-init-1.svg - -

This change will also cause each CPU's next call to -__note_gp_changes() -to notice that a new grace period has started, as described in the next -section. -But because the grace-period kthread started the grace period at the -root (with the advancing of the rcu_state structure's -->gp_seq field) before setting each leaf rcu_node -structure's ->gp_seq field, each CPU's observation of -the start of the grace period will happen after the actual start -of the grace period. - - - - - - - - -
 
Quick Quiz:
- But what about the CPU that started the grace period? - Why wouldn't it see the start of the grace period right when - it started that grace period? -
Answer:
- In some deep philosophical and overly anthromorphized - sense, yes, the CPU starting the grace period is immediately - aware of having done so. - However, if we instead assume that RCU is not self-aware, - then even the CPU starting the grace period does not really - become aware of the start of this grace period until its - first call to __note_gp_changes(). - On the other hand, this CPU potentially gets early notification - because it invokes __note_gp_changes() during its - last rcu_gp_init() pass through its leaf - rcu_node structure. -
 
- -

-Self-Reported Quiescent States

- -

When all entities that might block the grace period have reported -quiescent states (or as described in a later section, had quiescent -states reported on their behalf), the grace period can end. -Online non-idle CPUs report their own quiescent states, as shown -in the following diagram: - -

TreeRCU-qs.svg - -

This is for the last CPU to report a quiescent state, which signals -the end of the grace period. -Earlier quiescent states would push up the rcu_node tree -only until they encountered an rcu_node structure that -is waiting for additional quiescent states. -However, ordering is nevertheless preserved because some later quiescent -state will acquire that rcu_node structure's ->lock. - -

Any number of events can lead up to a CPU invoking -note_gp_changes (or alternatively, directly invoking -__note_gp_changes()), at which point that CPU will notice -the start of a new grace period while holding its leaf -rcu_node lock. -Therefore, all execution shown in this diagram happens after the -start of the grace period. -In addition, this CPU will consider any RCU read-side critical -section that started before the invocation of __note_gp_changes() -to have started before the grace period, and thus a critical -section that the grace period must wait on. - - - - - - - - -
 
Quick Quiz:
- But a RCU read-side critical section might have started - after the beginning of the grace period - (the advancing of ->gp_seq from earlier), so why should - the grace period wait on such a critical section? -
Answer:
- It is indeed not necessary for the grace period to wait on such - a critical section. - However, it is permissible to wait on it. - And it is furthermore important to wait on it, as this - lazy approach is far more scalable than a “big bang” - all-at-once grace-period start could possibly be. -
 
- -

If the CPU does a context switch, a quiescent state will be -noted by rcu_node_context_switch() on the left. -On the other hand, if the CPU takes a scheduler-clock interrupt -while executing in usermode, a quiescent state will be noted by -rcu_sched_clock_irq() on the right. -Either way, the passage through a quiescent state will be noted -in a per-CPU variable. - -

The next time an RCU_SOFTIRQ handler executes on -this CPU (for example, after the next scheduler-clock -interrupt), rcu_core() will invoke -rcu_check_quiescent_state(), which will notice the -recorded quiescent state, and invoke -rcu_report_qs_rdp(). -If rcu_report_qs_rdp() verifies that the quiescent state -really does apply to the current grace period, it invokes -rcu_report_rnp() which traverses up the rcu_node -tree as shown at the bottom of the diagram, clearing bits from -each rcu_node structure's ->qsmask field, -and propagating up the tree when the result is zero. - -

Note that traversal passes upwards out of a given rcu_node -structure only if the current CPU is reporting the last quiescent -state for the subtree headed by that rcu_node structure. -A key point is that if a CPU's traversal stops at a given rcu_node -structure, then there will be a later traversal by another CPU -(or perhaps the same one) that proceeds upwards -from that point, and the rcu_node ->lock -guarantees that the first CPU's quiescent state happens before the -remainder of the second CPU's traversal. -Applying this line of thought repeatedly shows that all CPUs' -quiescent states happen before the last CPU traverses through -the root rcu_node structure, the “last CPU” -being the one that clears the last bit in the root rcu_node -structure's ->qsmask field. - -

Dynamic Tick Interface

- -

Due to energy-efficiency considerations, RCU is forbidden from -disturbing idle CPUs. -CPUs are therefore required to notify RCU when entering or leaving idle -state, which they do via fully ordered value-returning atomic operations -on a per-CPU variable. -The ordering effects are as shown below: - -

TreeRCU-dyntick.svg - -

The RCU grace-period kernel thread samples the per-CPU idleness -variable while holding the corresponding CPU's leaf rcu_node -structure's ->lock. -This means that any RCU read-side critical sections that precede the -idle period (the oval near the top of the diagram above) will happen -before the end of the current grace period. -Similarly, the beginning of the current grace period will happen before -any RCU read-side critical sections that follow the -idle period (the oval near the bottom of the diagram above). - -

Plumbing this into the full grace-period execution is described -below. - -

CPU-Hotplug Interface

- -

RCU is also forbidden from disturbing offline CPUs, which might well -be powered off and removed from the system completely. -CPUs are therefore required to notify RCU of their comings and goings -as part of the corresponding CPU hotplug operations. -The ordering effects are shown below: - -

TreeRCU-hotplug.svg - -

Because CPU hotplug operations are much less frequent than idle transitions, -they are heavier weight, and thus acquire the CPU's leaf rcu_node -structure's ->lock and update this structure's -->qsmaskinitnext. -The RCU grace-period kernel thread samples this mask to detect CPUs -having gone offline since the beginning of this grace period. - -

Plumbing this into the full grace-period execution is described -below. - -

Forcing Quiescent States

- -

As noted above, idle and offline CPUs cannot report their own -quiescent states, and therefore the grace-period kernel thread -must do the reporting on their behalf. -This process is called “forcing quiescent states”, it is -repeated every few jiffies, and its ordering effects are shown below: - -

TreeRCU-gp-fqs.svg - -

Each pass of quiescent state forcing is guaranteed to traverse the -leaf rcu_node structures, and if there are no new quiescent -states due to recently idled and/or offlined CPUs, then only the -leaves are traversed. -However, if there is a newly offlined CPU as illustrated on the left -or a newly idled CPU as illustrated on the right, the corresponding -quiescent state will be driven up towards the root. -As with self-reported quiescent states, the upwards driving stops -once it reaches an rcu_node structure that has quiescent -states outstanding from other CPUs. - - - - - - - - -
 
Quick Quiz:
- The leftmost drive to root stopped before it reached - the root rcu_node structure, which means that - there are still CPUs subordinate to that structure on - which the current grace period is waiting. - Given that, how is it possible that the rightmost drive - to root ended the grace period? -
Answer:
- Good analysis! - It is in fact impossible in the absence of bugs in RCU. - But this diagram is complex enough as it is, so simplicity - overrode accuracy. - You can think of it as poetic license, or you can think of - it as misdirection that is resolved in the - stitched-together diagram. -
 
- -

Grace-Period Cleanup

- -

Grace-period cleanup first scans the rcu_node tree -breadth-first advancing all the ->gp_seq fields, then it -advances the rcu_state structure's ->gp_seq field. -The ordering effects are shown below: - -

TreeRCU-gp-cleanup.svg - -

As indicated by the oval at the bottom of the diagram, once -grace-period cleanup is complete, the next grace period can begin. - - - - - - - - -
 
Quick Quiz:
- But when precisely does the grace period end? -
Answer:
- There is no useful single point at which the grace period - can be said to end. - The earliest reasonable candidate is as soon as the last - CPU has reported its quiescent state, but it may be some - milliseconds before RCU becomes aware of this. - The latest reasonable candidate is once the rcu_state - structure's ->gp_seq field has been updated, - but it is quite possible that some CPUs have already completed - phase two of their updates by that time. - In short, if you are going to work with RCU, you need to - learn to embrace uncertainty. -
 
- - -

Callback Invocation

- -

Once a given CPU's leaf rcu_node structure's -->gp_seq field has been updated, that CPU can begin -invoking its RCU callbacks that were waiting for this grace period -to end. -These callbacks are identified by rcu_advance_cbs(), -which is usually invoked by __note_gp_changes(). -As shown in the diagram below, this invocation can be triggered by -the scheduling-clock interrupt (rcu_sched_clock_irq() on -the left) or by idle entry (rcu_cleanup_after_idle() on -the right, but only for kernels build with -CONFIG_RCU_FAST_NO_HZ=y). -Either way, RCU_SOFTIRQ is raised, which results in -rcu_do_batch() invoking the callbacks, which in turn -allows those callbacks to carry out (either directly or indirectly -via wakeup) the needed phase-two processing for each update. - -

TreeRCU-callback-invocation.svg - -

Please note that callback invocation can also be prompted by any -number of corner-case code paths, for example, when a CPU notes that -it has excessive numbers of callbacks queued. -In all cases, the CPU acquires its leaf rcu_node structure's -->lock before invoking callbacks, which preserves the -required ordering against the newly completed grace period. - -

However, if the callback function communicates to other CPUs, -for example, doing a wakeup, then it is that function's responsibility -to maintain ordering. -For example, if the callback function wakes up a task that runs on -some other CPU, proper ordering must in place in both the callback -function and the task being awakened. -To see why this is important, consider the top half of the -grace-period cleanup diagram. -The callback might be running on a CPU corresponding to the leftmost -leaf rcu_node structure, and awaken a task that is to run on -a CPU corresponding to the rightmost leaf rcu_node structure, -and the grace-period kernel thread might not yet have reached the -rightmost leaf. -In this case, the grace period's memory ordering might not yet have -reached that CPU, so again the callback function and the awakened -task must supply proper ordering. - -

Putting It All Together

- -

A stitched-together diagram is -here. - -

-Legal Statement

- -

This work represents the view of the author and does not necessarily -represent the view of IBM. - -

Linux is a registered trademark of Linus Torvalds. - -

Other company, product, and service names may be trademarks or -service marks of others. - - diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst new file mode 100644 index 0000000000000000000000000000000000000000..1011b5db1b3d95fc5666886765fdfd231c7fd177 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -0,0 +1,625 @@ +====================================================== +A Tour Through TREE_RCU's Grace-Period Memory Ordering +====================================================== + +August 8, 2017 + +This article was contributed by Paul E. McKenney + +Introduction +============ + +This document gives a rough visual overview of how Tree RCU's +grace-period memory ordering guarantee is provided. + +What Is Tree RCU's Grace Period Memory Ordering Guarantee? +========================================================== + +RCU grace periods provide extremely strong memory-ordering guarantees +for non-idle non-offline code. +Any code that happens after the end of a given RCU grace period is guaranteed +to see the effects of all accesses prior to the beginning of that grace +period that are within RCU read-side critical sections. +Similarly, any code that happens before the beginning of a given RCU grace +period is guaranteed to see the effects of all accesses following the end +of that grace period that are within RCU read-side critical sections. + +Note well that RCU-sched read-side critical sections include any region +of code for which preemption is disabled. +Given that each individual machine instruction can be thought of as +an extremely small region of preemption-disabled code, one can think of +``synchronize_rcu()`` as ``smp_mb()`` on steroids. + +RCU updaters use this guarantee by splitting their updates into +two phases, one of which is executed before the grace period and +the other of which is executed after the grace period. +In the most common use case, phase one removes an element from +a linked RCU-protected data structure, and phase two frees that element. +For this to work, any readers that have witnessed state prior to the +phase-one update (in the common case, removal) must not witness state +following the phase-two update (in the common case, freeing). + +The RCU implementation provides this guarantee using a network +of lock-based critical sections, memory barriers, and per-CPU +processing, as is described in the following sections. + +Tree RCU Grace Period Memory Ordering Building Blocks +===================================================== + +The workhorse for RCU's grace-period memory ordering is the +critical section for the ``rcu_node`` structure's +``->lock``. These critical sections use helper functions for lock +acquisition, including ``raw_spin_lock_rcu_node()``, +``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``. +Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``, +``raw_spin_unlock_irq_rcu_node()``, and +``raw_spin_unlock_irqrestore_rcu_node()``, respectively. +For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided. +The key point is that the lock-acquisition functions, including +``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()`` +immediately after successful acquisition of the lock. + +Therefore, for any given ``rcu_node`` structure, any access +happening before one of the above lock-release functions will be seen +by all CPUs as happening before any access happening after a later +one of the above lock-acquisition functions. +Furthermore, any access happening before one of the +above lock-release function on any given CPU will be seen by all +CPUs as happening before any access happening after a later one +of the above lock-acquisition functions executing on that same CPU, +even if the lock-release and lock-acquisition functions are operating +on different ``rcu_node`` structures. +Tree RCU uses these two ordering guarantees to form an ordering +network among all CPUs that were in any way involved in the grace +period, including any CPUs that came online or went offline during +the grace period in question. + +The following litmus test exhibits the ordering effects of these +lock-acquisition and lock-release functions:: + + 1 int x, y, z; + 2 + 3 void task0(void) + 4 { + 5 raw_spin_lock_rcu_node(rnp); + 6 WRITE_ONCE(x, 1); + 7 r1 = READ_ONCE(y); + 8 raw_spin_unlock_rcu_node(rnp); + 9 } + 10 + 11 void task1(void) + 12 { + 13 raw_spin_lock_rcu_node(rnp); + 14 WRITE_ONCE(y, 1); + 15 r2 = READ_ONCE(z); + 16 raw_spin_unlock_rcu_node(rnp); + 17 } + 18 + 19 void task2(void) + 20 { + 21 WRITE_ONCE(z, 1); + 22 smp_mb(); + 23 r3 = READ_ONCE(x); + 24 } + 25 + 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); + +The ``WARN_ON()`` is evaluated at “the end of time”, +after all changes have propagated throughout the system. +Without the ``smp_mb__after_unlock_lock()`` provided by the +acquisition functions, this ``WARN_ON()`` could trigger, for example +on PowerPC. +The ``smp_mb__after_unlock_lock()`` invocations prevent this +``WARN_ON()`` from triggering. + +This approach must be extended to include idle CPUs, which need +RCU's grace-period memory ordering guarantee to extend to any +RCU read-side critical sections preceding and following the current +idle sojourn. +This case is handled by calls to the strongly ordered +``atomic_add_return()`` read-modify-write atomic operation that +is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry +time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time. +The grace-period kthread invokes ``rcu_dynticks_snap()`` and +``rcu_dynticks_in_eqs_since()`` (both of which invoke +an ``atomic_add_return()`` of zero) to detect idle CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about CPUs that remain offline for the entire grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Such CPUs will be offline at the beginning of the grace period, so | +| the grace period won't expect quiescent states from them. Races | +| between grace-period start and CPU-hotplug operations are mediated | +| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described | +| above. | ++-----------------------------------------------------------------------+ + +The approach must be extended to handle one final case, that of waking a +task blocked in ``synchronize_rcu()``. This task might be affinitied to +a CPU that is not yet aware that the grace period has ended, and thus +might not yet be subject to the grace period's memory ordering. +Therefore, there is an ``smp_mb()`` after the return from +``wait_for_completion()`` in the ``synchronize_rcu()`` code path. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What? Where??? I don't see any ``smp_mb()`` after the return from | +| ``wait_for_completion()``!!! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| That would be because I spotted the need for that ``smp_mb()`` during | +| the creation of this documentation, and it is therefore unlikely to | +| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter | +| Zijlstra, and Jonathan Cameron for asking questions that sensitized | +| me to the rather elaborate sequence of events that demonstrate the | +| need for this memory barrier. | ++-----------------------------------------------------------------------+ + +Tree RCU's grace--period memory-ordering guarantees rely most heavily on +the ``rcu_node`` structure's ``->lock`` field, so much so that it is +necessary to abbreviate this pattern in the diagrams in the next +section. For example, consider the ``rcu_prepare_for_idle()`` function +shown below, which is one of several functions that enforce ordering of +newly arrived RCU callbacks against future grace periods: + +:: + + 1 static void rcu_prepare_for_idle(void) + 2 { + 3 bool needwake; + 4 struct rcu_data *rdp; + 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); + 6 struct rcu_node *rnp; + 7 struct rcu_state *rsp; + 8 int tne; + 9 + 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || + 11 rcu_is_nocb_cpu(smp_processor_id())) + 12 return; + 13 tne = READ_ONCE(tick_nohz_active); + 14 if (tne != rdtp->tick_nohz_enabled_snap) { + 15 if (rcu_cpu_has_callbacks(NULL)) + 16 invoke_rcu_core(); + 17 rdtp->tick_nohz_enabled_snap = tne; + 18 return; + 19 } + 20 if (!tne) + 21 return; + 22 if (rdtp->all_lazy && + 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { + 24 rdtp->all_lazy = false; + 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; + 26 invoke_rcu_core(); + 27 return; + 28 } + 29 if (rdtp->last_accelerate == jiffies) + 30 return; + 31 rdtp->last_accelerate = jiffies; + 32 for_each_rcu_flavor(rsp) { + 33 rdp = this_cpu_ptr(rsp->rda); + 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) + 35 continue; + 36 rnp = rdp->mynode; + 37 raw_spin_lock_rcu_node(rnp); + 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); + 39 raw_spin_unlock_rcu_node(rnp); + 40 if (needwake) + 41 rcu_gp_kthread_wake(rsp); + 42 } + 43 } + +But the only part of ``rcu_prepare_for_idle()`` that really matters for +this discussion are lines 37–39. We will therefore abbreviate this +function as follows: + +.. kernel-figure:: rcu_node-lock.svg + +The box represents the ``rcu_node`` structure's ``->lock`` critical +section, with the double line on top representing the additional +``smp_mb__after_unlock_lock()``. + +Tree RCU Grace Period Memory Ordering Components +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Tree RCU's grace-period memory-ordering guarantee is provided by a +number of RCU components: + +#. `Callback Registry <#Callback%20Registry>`__ +#. `Grace-Period Initialization <#Grace-Period%20Initialization>`__ +#. `Self-Reported Quiescent + States <#Self-Reported%20Quiescent%20States>`__ +#. `Dynamic Tick Interface <#Dynamic%20Tick%20Interface>`__ +#. `CPU-Hotplug Interface <#CPU-Hotplug%20Interface>`__ +#. `Forcing Quiescent States `__ +#. `Grace-Period Cleanup `__ +#. `Callback Invocation `__ + +Each of the following section looks at the corresponding component in +detail. + +Callback Registry +^^^^^^^^^^^^^^^^^ + +If RCU's grace-period guarantee is to mean anything at all, any access +that happens before a given invocation of ``call_rcu()`` must also +happen before the corresponding grace period. The implementation of this +portion of RCU's grace period guarantee is shown in the following +figure: + +.. kernel-figure:: TreeRCU-callback-registry.svg + +Because ``call_rcu()`` normally acts only on CPU-local state, it +provides no ordering guarantees, either for itself or for phase one of +the update (which again will usually be removal of an element from an +RCU-protected data structure). It simply enqueues the ``rcu_head`` +structure on a per-CPU list, which cannot become associated with a grace +period until a later call to ``rcu_accelerate_cbs()``, as shown in the +diagram above. + +One set of code paths shown on the left invokes ``rcu_accelerate_cbs()`` +via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the +current CPU is inundated with queued ``rcu_head`` structures) or more +likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle +is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which +invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The +final code path on the right is taken only in kernels built with +``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via +``rcu_advance_cbs()``, ``rcu_migrate_callbacks``, +``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn +is invoked on a surviving CPU after the outgoing CPU has been completely +offlined. + +There are a few other code paths within grace-period processing that +opportunistically invoke ``rcu_accelerate_cbs()``. However, either way, +all of the CPU's recently queued ``rcu_head`` structures are associated +with a future grace-period number under the protection of the CPU's lead +``rcu_node`` structure's ``->lock``. In all cases, there is full +ordering against any prior critical section for that same ``rcu_node`` +structure's ``->lock``, and also full ordering against any of the +current task's or CPU's prior critical sections for any ``rcu_node`` +structure's ``->lock``. + +The next section will show how this ordering ensures that any accesses +prior to the ``call_rcu()`` (particularly including phase one of the +update) happen before the start of the corresponding grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, | +| which invokes it. So either way, it eventually comes down to | +| ``call_rcu()``. | ++-----------------------------------------------------------------------+ + +Grace-Period Initialization +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Grace-period initialization is carried out by the grace-period kernel +thread, which makes several passes over the ``rcu_node`` tree within the +``rcu_gp_init()`` function. This means that showing the full flow of +ordering through the grace-period computation will require duplicating +this tree. If you find this confusing, please note that the state of the +``rcu_node`` changes over time, just like Heraclitus's river. However, +to keep the ``rcu_node`` river tractable, the grace-period kernel +thread's traversals are presented in multiple parts, starting in this +section with the various phases of grace-period initialization. + +The first ordering-related grace-period initialization action is to +advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number +counter, as shown below: + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The actual increment is carried out using ``smp_store_release()``, which +helps reject false-positive RCU CPU stall detection. Note that only the +root ``rcu_node`` structure is touched. + +The first pass through the ``rcu_node`` tree updates bitmasks based on +CPUs having come online or gone offline since the start of the previous +grace period. In the common case where the number of online CPUs for +this ``rcu_node`` structure has not transitioned to or from zero, this +pass will scan only the leaf ``rcu_node`` structures. However, if the +number of online CPUs for a given leaf ``rcu_node`` structure has +transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the +first incoming CPU. Similarly, if the number of online CPUs for a given +leaf ``rcu_node`` structure has transitioned to zero, +``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU. +The diagram below shows the path of ordering if the leftmost +``rcu_node`` structure onlines its first CPU and if the next +``rcu_node`` structure has no online CPUs (or, alternatively if the +leftmost ``rcu_node`` structure offlines its last CPU and if the next +``rcu_node`` structure has no online CPUs). + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses +breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field +to the newly advanced value from the ``rcu_state`` structure, as shown +in the following diagram. + +.. kernel-figure:: TreeRCU-gp-init-1.svg + +This change will also cause each CPU's next call to +``__note_gp_changes()`` to notice that a new grace period has started, +as described in the next section. But because the grace-period kthread +started the grace period at the root (with the advancing of the +``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf +``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of +the start of the grace period will happen after the actual start of the +grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what about the CPU that started the grace period? Why wouldn't it | +| see the start of the grace period right when it started that grace | +| period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In some deep philosophical and overly anthromorphized sense, yes, the | +| CPU starting the grace period is immediately aware of having done so. | +| However, if we instead assume that RCU is not self-aware, then even | +| the CPU starting the grace period does not really become aware of the | +| start of this grace period until its first call to | +| ``__note_gp_changes()``. On the other hand, this CPU potentially gets | +| early notification because it invokes ``__note_gp_changes()`` during | +| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` | +| structure. | ++-----------------------------------------------------------------------+ + +Self-Reported Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When all entities that might block the grace period have reported +quiescent states (or as described in a later section, had quiescent +states reported on their behalf), the grace period can end. Online +non-idle CPUs report their own quiescent states, as shown in the +following diagram: + +.. kernel-figure:: TreeRCU-qs.svg + +This is for the last CPU to report a quiescent state, which signals the +end of the grace period. Earlier quiescent states would push up the +``rcu_node`` tree only until they encountered an ``rcu_node`` structure +that is waiting for additional quiescent states. However, ordering is +nevertheless preserved because some later quiescent state will acquire +that ``rcu_node`` structure's ``->lock``. + +Any number of events can lead up to a CPU invoking ``note_gp_changes`` +(or alternatively, directly invoking ``__note_gp_changes()``), at which +point that CPU will notice the start of a new grace period while holding +its leaf ``rcu_node`` lock. Therefore, all execution shown in this +diagram happens after the start of the grace period. In addition, this +CPU will consider any RCU read-side critical section that started before +the invocation of ``__note_gp_changes()`` to have started before the +grace period, and thus a critical section that the grace period must +wait on. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But a RCU read-side critical section might have started after the | +| beginning of the grace period (the advancing of ``->gp_seq`` from | +| earlier), so why should the grace period wait on such a critical | +| section? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It is indeed not necessary for the grace period to wait on such a | +| critical section. However, it is permissible to wait on it. And it is | +| furthermore important to wait on it, as this lazy approach is far | +| more scalable than a “big bang” all-at-once grace-period start could | +| possibly be. | ++-----------------------------------------------------------------------+ + +If the CPU does a context switch, a quiescent state will be noted by +``rcu_node_context_switch()`` on the left. On the other hand, if the CPU +takes a scheduler-clock interrupt while executing in usermode, a +quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right. +Either way, the passage through a quiescent state will be noted in a +per-CPU variable. + +The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for +example, after the next scheduler-clock interrupt), ``rcu_core()`` will +invoke ``rcu_check_quiescent_state()``, which will notice the recorded +quiescent state, and invoke ``rcu_report_qs_rdp()``. If +``rcu_report_qs_rdp()`` verifies that the quiescent state really does +apply to the current grace period, it invokes ``rcu_report_rnp()`` which +traverses up the ``rcu_node`` tree as shown at the bottom of the +diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask`` +field, and propagating up the tree when the result is zero. + +Note that traversal passes upwards out of a given ``rcu_node`` structure +only if the current CPU is reporting the last quiescent state for the +subtree headed by that ``rcu_node`` structure. A key point is that if a +CPU's traversal stops at a given ``rcu_node`` structure, then there will +be a later traversal by another CPU (or perhaps the same one) that +proceeds upwards from that point, and the ``rcu_node`` ``->lock`` +guarantees that the first CPU's quiescent state happens before the +remainder of the second CPU's traversal. Applying this line of thought +repeatedly shows that all CPUs' quiescent states happen before the last +CPU traverses through the root ``rcu_node`` structure, the “last CPU” +being the one that clears the last bit in the root ``rcu_node`` +structure's ``->qsmask`` field. + +Dynamic Tick Interface +^^^^^^^^^^^^^^^^^^^^^^ + +Due to energy-efficiency considerations, RCU is forbidden from +disturbing idle CPUs. CPUs are therefore required to notify RCU when +entering or leaving idle state, which they do via fully ordered +value-returning atomic operations on a per-CPU variable. The ordering +effects are as shown below: + +.. kernel-figure:: TreeRCU-dyntick.svg + +The RCU grace-period kernel thread samples the per-CPU idleness variable +while holding the corresponding CPU's leaf ``rcu_node`` structure's +``->lock``. This means that any RCU read-side critical sections that +precede the idle period (the oval near the top of the diagram above) +will happen before the end of the current grace period. Similarly, the +beginning of the current grace period will happen before any RCU +read-side critical sections that follow the idle period (the oval near +the bottom of the diagram above). + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +CPU-Hotplug Interface +^^^^^^^^^^^^^^^^^^^^^ + +RCU is also forbidden from disturbing offline CPUs, which might well be +powered off and removed from the system completely. CPUs are therefore +required to notify RCU of their comings and goings as part of the +corresponding CPU hotplug operations. The ordering effects are shown +below: + +.. kernel-figure:: TreeRCU-hotplug.svg + +Because CPU hotplug operations are much less frequent than idle +transitions, they are heavier weight, and thus acquire the CPU's leaf +``rcu_node`` structure's ``->lock`` and update this structure's +``->qsmaskinitnext``. The RCU grace-period kernel thread samples this +mask to detect CPUs having gone offline since the beginning of this +grace period. + +Plumbing this into the full grace-period execution is described +`below <#Forcing%20Quiescent%20States>`__. + +Forcing Quiescent States +^^^^^^^^^^^^^^^^^^^^^^^^ + +As noted above, idle and offline CPUs cannot report their own quiescent +states, and therefore the grace-period kernel thread must do the +reporting on their behalf. This process is called “forcing quiescent +states”, it is repeated every few jiffies, and its ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-fqs.svg + +Each pass of quiescent state forcing is guaranteed to traverse the leaf +``rcu_node`` structures, and if there are no new quiescent states due to +recently idled and/or offlined CPUs, then only the leaves are traversed. +However, if there is a newly offlined CPU as illustrated on the left or +a newly idled CPU as illustrated on the right, the corresponding +quiescent state will be driven up towards the root. As with +self-reported quiescent states, the upwards driving stops once it +reaches an ``rcu_node`` structure that has quiescent states outstanding +from other CPUs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The leftmost drive to root stopped before it reached the root | +| ``rcu_node`` structure, which means that there are still CPUs | +| subordinate to that structure on which the current grace period is | +| waiting. Given that, how is it possible that the rightmost drive to | +| root ended the grace period? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Good analysis! It is in fact impossible in the absence of bugs in | +| RCU. But this diagram is complex enough as it is, so simplicity | +| overrode accuracy. You can think of it as poetic license, or you can | +| think of it as misdirection that is resolved in the | +| `stitched-together diagram <#Putting%20It%20All%20Together>`__. | ++-----------------------------------------------------------------------+ + +Grace-Period Cleanup +^^^^^^^^^^^^^^^^^^^^ + +Grace-period cleanup first scans the ``rcu_node`` tree breadth-first +advancing all the ``->gp_seq`` fields, then it advances the +``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are +shown below: + +.. kernel-figure:: TreeRCU-gp-cleanup.svg + +As indicated by the oval at the bottom of the diagram, once grace-period +cleanup is complete, the next grace period can begin. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But when precisely does the grace period end? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| There is no useful single point at which the grace period can be said | +| to end. The earliest reasonable candidate is as soon as the last CPU | +| has reported its quiescent state, but it may be some milliseconds | +| before RCU becomes aware of this. The latest reasonable candidate is | +| once the ``rcu_state`` structure's ``->gp_seq`` field has been | +| updated, but it is quite possible that some CPUs have already | +| completed phase two of their updates by that time. In short, if you | +| are going to work with RCU, you need to learn to embrace uncertainty. | ++-----------------------------------------------------------------------+ + +Callback Invocation +^^^^^^^^^^^^^^^^^^^ + +Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has +been updated, that CPU can begin invoking its RCU callbacks that were +waiting for this grace period to end. These callbacks are identified by +``rcu_advance_cbs()``, which is usually invoked by +``__note_gp_changes()``. As shown in the diagram below, this invocation +can be triggered by the scheduling-clock interrupt +(``rcu_sched_clock_irq()`` on the left) or by idle entry +(``rcu_cleanup_after_idle()`` on the right, but only for kernels build +with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is +raised, which results in ``rcu_do_batch()`` invoking the callbacks, +which in turn allows those callbacks to carry out (either directly or +indirectly via wakeup) the needed phase-two processing for each update. + +.. kernel-figure:: TreeRCU-callback-invocation.svg + +Please note that callback invocation can also be prompted by any number +of corner-case code paths, for example, when a CPU notes that it has +excessive numbers of callbacks queued. In all cases, the CPU acquires +its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks, +which preserves the required ordering against the newly completed grace +period. + +However, if the callback function communicates to other CPUs, for +example, doing a wakeup, then it is that function's responsibility to +maintain ordering. For example, if the callback function wakes up a task +that runs on some other CPU, proper ordering must in place in both the +callback function and the task being awakened. To see why this is +important, consider the top half of the `grace-period +cleanup <#Grace-Period%20Cleanup>`__ diagram. The callback might be +running on a CPU corresponding to the leftmost leaf ``rcu_node`` +structure, and awaken a task that is to run on a CPU corresponding to +the rightmost leaf ``rcu_node`` structure, and the grace-period kernel +thread might not yet have reached the rightmost leaf. In this case, the +grace period's memory ordering might not yet have reached that CPU, so +again the callback function and the awakened task must supply proper +ordering. + +Putting It All Together +~~~~~~~~~~~~~~~~~~~~~~~ + +A stitched-together diagram is here: + +.. kernel-figure:: TreeRCU-gp.svg + +Legal Statement +~~~~~~~~~~~~~~~ + +This work represents the view of the author and does not necessarily +represent the view of IBM. + +Linux is a registered trademark of Linus Torvalds. + +Other company, product, and service names may be trademarks or service +marks of others. diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html deleted file mode 100644 index 5a9238a2883c6a689fb9c271afd815189369d74e..0000000000000000000000000000000000000000 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ /dev/null @@ -1,3330 +0,0 @@ - - - A Tour Through RCU's Requirements [LWN.net] - - -

A Tour Through RCU's Requirements

- -

Copyright IBM Corporation, 2015

-

Author: Paul E. McKenney

-

The initial version of this document appeared in the -LWN articles -here, -here, and -here.

- -

Introduction

- -

-Read-copy update (RCU) is a synchronization mechanism that is often -used as a replacement for reader-writer locking. -RCU is unusual in that updaters do not block readers, -which means that RCU's read-side primitives can be exceedingly fast -and scalable. -In addition, updaters can make useful forward progress concurrently -with readers. -However, all this concurrency between RCU readers and updaters does raise -the question of exactly what RCU readers are doing, which in turn -raises the question of exactly what RCU's requirements are. - -

-This document therefore summarizes RCU's requirements, and can be thought -of as an informal, high-level specification for RCU. -It is important to understand that RCU's specification is primarily -empirical in nature; -in fact, I learned about many of these requirements the hard way. -This situation might cause some consternation, however, not only -has this learning process been a lot of fun, but it has also been -a great privilege to work with so many people willing to apply -technologies in interesting new ways. - -

-All that aside, here are the categories of currently known RCU requirements: -

- -
    -
  1. - Fundamental Requirements -
  2. Fundamental Non-Requirements -
  3. - Parallelism Facts of Life -
  4. - Quality-of-Implementation Requirements -
  5. - Linux Kernel Complications -
  6. - Software-Engineering Requirements -
  7. - Other RCU Flavors -
  8. - Possible Future Changes -
- -

-This is followed by a summary, -however, the answers to each quick quiz immediately follows the quiz. -Select the big white space with your mouse to see the answer. - -

Fundamental Requirements

- -

-RCU's fundamental requirements are the closest thing RCU has to hard -mathematical requirements. -These are: - -

    -
  1. - Grace-Period Guarantee -
  2. - Publish-Subscribe Guarantee -
  3. - Memory-Barrier Guarantees -
  4. - RCU Primitives Guaranteed to Execute Unconditionally -
  5. - Guaranteed Read-to-Write Upgrade -
- -

Grace-Period Guarantee

- -

-RCU's grace-period guarantee is unusual in being premeditated: -Jack Slingwine and I had this guarantee firmly in mind when we started -work on RCU (then called “rclock”) in the early 1990s. -That said, the past two decades of experience with RCU have produced -a much more detailed understanding of this guarantee. - -

-RCU's grace-period guarantee allows updaters to wait for the completion -of all pre-existing RCU read-side critical sections. -An RCU read-side critical section -begins with the marker rcu_read_lock() and ends with -the marker rcu_read_unlock(). -These markers may be nested, and RCU treats a nested set as one -big RCU read-side critical section. -Production-quality implementations of rcu_read_lock() and -rcu_read_unlock() are extremely lightweight, and in -fact have exactly zero overhead in Linux kernels built for production -use with CONFIG_PREEMPT=n. - -

-This guarantee allows ordering to be enforced with extremely low -overhead to readers, for example: - -

-
- 1 int x, y;
- 2
- 3 void thread0(void)
- 4 {
- 5   rcu_read_lock();
- 6   r1 = READ_ONCE(x);
- 7   r2 = READ_ONCE(y);
- 8   rcu_read_unlock();
- 9 }
-10
-11 void thread1(void)
-12 {
-13   WRITE_ONCE(x, 1);
-14   synchronize_rcu();
-15   WRITE_ONCE(y, 1);
-16 }
-
-
- -

-Because the synchronize_rcu() on line 14 waits for -all pre-existing readers, any instance of thread0() that -loads a value of zero from x must complete before -thread1() stores to y, so that instance must -also load a value of zero from y. -Similarly, any instance of thread0() that loads a value of -one from y must have started after the -synchronize_rcu() started, and must therefore also load -a value of one from x. -Therefore, the outcome: -

-
-(r1 == 0 && r2 == 1)
-
-
-cannot happen. - - - - - - - - -
 
Quick Quiz:
- Wait a minute! - You said that updaters can make useful forward progress concurrently - with readers, but pre-existing readers will block - synchronize_rcu()!!! - Just who are you trying to fool??? -
Answer:
- First, if updaters do not wish to be blocked by readers, they can use - call_rcu() or kfree_rcu(), which will - be discussed later. - Second, even when using synchronize_rcu(), the other - update-side code does run concurrently with readers, whether - pre-existing or not. -
 
- -

-This scenario resembles one of the first uses of RCU in -DYNIX/ptx, -which managed a distributed lock manager's transition into -a state suitable for handling recovery from node failure, -more or less as follows: - -

-
- 1 #define STATE_NORMAL        0
- 2 #define STATE_WANT_RECOVERY 1
- 3 #define STATE_RECOVERING    2
- 4 #define STATE_WANT_NORMAL   3
- 5
- 6 int state = STATE_NORMAL;
- 7
- 8 void do_something_dlm(void)
- 9 {
-10   int state_snap;
-11
-12   rcu_read_lock();
-13   state_snap = READ_ONCE(state);
-14   if (state_snap == STATE_NORMAL)
-15     do_something();
-16   else
-17     do_something_carefully();
-18   rcu_read_unlock();
-19 }
-20
-21 void start_recovery(void)
-22 {
-23   WRITE_ONCE(state, STATE_WANT_RECOVERY);
-24   synchronize_rcu();
-25   WRITE_ONCE(state, STATE_RECOVERING);
-26   recovery();
-27   WRITE_ONCE(state, STATE_WANT_NORMAL);
-28   synchronize_rcu();
-29   WRITE_ONCE(state, STATE_NORMAL);
-30 }
-
-
- -

-The RCU read-side critical section in do_something_dlm() -works with the synchronize_rcu() in start_recovery() -to guarantee that do_something() never runs concurrently -with recovery(), but with little or no synchronization -overhead in do_something_dlm(). - - - - - - - - -
 
Quick Quiz:
- Why is the synchronize_rcu() on line 28 needed? -
Answer:
- Without that extra grace period, memory reordering could result in - do_something_dlm() executing do_something() - concurrently with the last bits of recovery(). -
 
- -

-In order to avoid fatal problems such as deadlocks, -an RCU read-side critical section must not contain calls to -synchronize_rcu(). -Similarly, an RCU read-side critical section must not -contain anything that waits, directly or indirectly, on completion of -an invocation of synchronize_rcu(). - -

-Although RCU's grace-period guarantee is useful in and of itself, with -quite a few use cases, -it would be good to be able to use RCU to coordinate read-side -access to linked data structures. -For this, the grace-period guarantee is not sufficient, as can -be seen in function add_gp_buggy() below. -We will look at the reader's code later, but in the meantime, just think of -the reader as locklessly picking up the gp pointer, -and, if the value loaded is non-NULL, locklessly accessing the -->a and ->b fields. - -

-
- 1 bool add_gp_buggy(int a, int b)
- 2 {
- 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
- 4   if (!p)
- 5     return -ENOMEM;
- 6   spin_lock(&gp_lock);
- 7   if (rcu_access_pointer(gp)) {
- 8     spin_unlock(&gp_lock);
- 9     return false;
-10   }
-11   p->a = a;
-12   p->b = a;
-13   gp = p; /* ORDERING BUG */
-14   spin_unlock(&gp_lock);
-15   return true;
-16 }
-
-
- -

-The problem is that both the compiler and weakly ordered CPUs are within -their rights to reorder this code as follows: - -

-
- 1 bool add_gp_buggy_optimized(int a, int b)
- 2 {
- 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
- 4   if (!p)
- 5     return -ENOMEM;
- 6   spin_lock(&gp_lock);
- 7   if (rcu_access_pointer(gp)) {
- 8     spin_unlock(&gp_lock);
- 9     return false;
-10   }
-11   gp = p; /* ORDERING BUG */
-12   p->a = a;
-13   p->b = a;
-14   spin_unlock(&gp_lock);
-15   return true;
-16 }
-
-
- -

-If an RCU reader fetches gp just after -add_gp_buggy_optimized executes line 11, -it will see garbage in the ->a and ->b -fields. -And this is but one of many ways in which compiler and hardware optimizations -could cause trouble. -Therefore, we clearly need some way to prevent the compiler and the CPU from -reordering in this manner, which brings us to the publish-subscribe -guarantee discussed in the next section. - -

Publish/Subscribe Guarantee

- -

-RCU's publish-subscribe guarantee allows data to be inserted -into a linked data structure without disrupting RCU readers. -The updater uses rcu_assign_pointer() to insert the -new data, and readers use rcu_dereference() to -access data, whether new or old. -The following shows an example of insertion: - -

-
- 1 bool add_gp(int a, int b)
- 2 {
- 3   p = kmalloc(sizeof(*p), GFP_KERNEL);
- 4   if (!p)
- 5     return -ENOMEM;
- 6   spin_lock(&gp_lock);
- 7   if (rcu_access_pointer(gp)) {
- 8     spin_unlock(&gp_lock);
- 9     return false;
-10   }
-11   p->a = a;
-12   p->b = a;
-13   rcu_assign_pointer(gp, p);
-14   spin_unlock(&gp_lock);
-15   return true;
-16 }
-
-
- -

-The rcu_assign_pointer() on line 13 is conceptually -equivalent to a simple assignment statement, but also guarantees -that its assignment will -happen after the two assignments in lines 11 and 12, -similar to the C11 memory_order_release store operation. -It also prevents any number of “interesting” compiler -optimizations, for example, the use of gp as a scratch -location immediately preceding the assignment. - - - - - - - - -
 
Quick Quiz:
- But rcu_assign_pointer() does nothing to prevent the - two assignments to p->a and p->b - from being reordered. - Can't that also cause problems? -
Answer:
- No, it cannot. - The readers cannot see either of these two fields until - the assignment to gp, by which time both fields are - fully initialized. - So reordering the assignments - to p->a and p->b cannot possibly - cause any problems. -
 
- -

-It is tempting to assume that the reader need not do anything special -to control its accesses to the RCU-protected data, -as shown in do_something_gp_buggy() below: - -

-
- 1 bool do_something_gp_buggy(void)
- 2 {
- 3   rcu_read_lock();
- 4   p = gp;  /* OPTIMIZATIONS GALORE!!! */
- 5   if (p) {
- 6     do_something(p->a, p->b);
- 7     rcu_read_unlock();
- 8     return true;
- 9   }
-10   rcu_read_unlock();
-11   return false;
-12 }
-
-
- -

-However, this temptation must be resisted because there are a -surprisingly large number of ways that the compiler -(to say nothing of -DEC Alpha CPUs) -can trip this code up. -For but one example, if the compiler were short of registers, it -might choose to refetch from gp rather than keeping -a separate copy in p as follows: - -

-
- 1 bool do_something_gp_buggy_optimized(void)
- 2 {
- 3   rcu_read_lock();
- 4   if (gp) { /* OPTIMIZATIONS GALORE!!! */
- 5     do_something(gp->a, gp->b);
- 6     rcu_read_unlock();
- 7     return true;
- 8   }
- 9   rcu_read_unlock();
-10   return false;
-11 }
-
-
- -

-If this function ran concurrently with a series of updates that -replaced the current structure with a new one, -the fetches of gp->a -and gp->b might well come from two different structures, -which could cause serious confusion. -To prevent this (and much else besides), do_something_gp() uses -rcu_dereference() to fetch from gp: - -

-
- 1 bool do_something_gp(void)
- 2 {
- 3   rcu_read_lock();
- 4   p = rcu_dereference(gp);
- 5   if (p) {
- 6     do_something(p->a, p->b);
- 7     rcu_read_unlock();
- 8     return true;
- 9   }
-10   rcu_read_unlock();
-11   return false;
-12 }
-
-
- -

-The rcu_dereference() uses volatile casts and (for DEC Alpha) -memory barriers in the Linux kernel. -Should a -high-quality implementation of C11 memory_order_consume [PDF] -ever appear, then rcu_dereference() could be implemented -as a memory_order_consume load. -Regardless of the exact implementation, a pointer fetched by -rcu_dereference() may not be used outside of the -outermost RCU read-side critical section containing that -rcu_dereference(), unless protection of -the corresponding data element has been passed from RCU to some -other synchronization mechanism, most commonly locking or -reference counting. - -

-In short, updaters use rcu_assign_pointer() and readers -use rcu_dereference(), and these two RCU API elements -work together to ensure that readers have a consistent view of -newly added data elements. - -

-Of course, it is also necessary to remove elements from RCU-protected -data structures, for example, using the following process: - -

    -
  1. Remove the data element from the enclosing structure. -
  2. Wait for all pre-existing RCU read-side critical sections - to complete (because only pre-existing readers can possibly have - a reference to the newly removed data element). -
  3. At this point, only the updater has a reference to the - newly removed data element, so it can safely reclaim - the data element, for example, by passing it to kfree(). -
- -This process is implemented by remove_gp_synchronous(): - -
-
- 1 bool remove_gp_synchronous(void)
- 2 {
- 3   struct foo *p;
- 4
- 5   spin_lock(&gp_lock);
- 6   p = rcu_access_pointer(gp);
- 7   if (!p) {
- 8     spin_unlock(&gp_lock);
- 9     return false;
-10   }
-11   rcu_assign_pointer(gp, NULL);
-12   spin_unlock(&gp_lock);
-13   synchronize_rcu();
-14   kfree(p);
-15   return true;
-16 }
-
-
- -

-This function is straightforward, with line 13 waiting for a grace -period before line 14 frees the old data element. -This waiting ensures that readers will reach line 7 of -do_something_gp() before the data element referenced by -p is freed. -The rcu_access_pointer() on line 6 is similar to -rcu_dereference(), except that: - -

    -
  1. The value returned by rcu_access_pointer() - cannot be dereferenced. - If you want to access the value pointed to as well as - the pointer itself, use rcu_dereference() - instead of rcu_access_pointer(). -
  2. The call to rcu_access_pointer() need not be - protected. - In contrast, rcu_dereference() must either be - within an RCU read-side critical section or in a code - segment where the pointer cannot change, for example, in - code protected by the corresponding update-side lock. -
- - - - - - - - -
 
Quick Quiz:
- Without the rcu_dereference() or the - rcu_access_pointer(), what destructive optimizations - might the compiler make use of? -
Answer:
- Let's start with what happens to do_something_gp() - if it fails to use rcu_dereference(). - It could reuse a value formerly fetched from this same pointer. - It could also fetch the pointer from gp in a byte-at-a-time - manner, resulting in load tearing, in turn resulting a bytewise - mash-up of two distinct pointer values. - It might even use value-speculation optimizations, where it makes - a wrong guess, but by the time it gets around to checking the - value, an update has changed the pointer to match the wrong guess. - Too bad about any dereferences that returned pre-initialization garbage - in the meantime! - - -

- For remove_gp_synchronous(), as long as all modifications - to gp are carried out while holding gp_lock, - the above optimizations are harmless. - However, sparse will complain if you - define gp with __rcu and then - access it without using - either rcu_access_pointer() or rcu_dereference(). -

 
- -

-In short, RCU's publish-subscribe guarantee is provided by the combination -of rcu_assign_pointer() and rcu_dereference(). -This guarantee allows data elements to be safely added to RCU-protected -linked data structures without disrupting RCU readers. -This guarantee can be used in combination with the grace-period -guarantee to also allow data elements to be removed from RCU-protected -linked data structures, again without disrupting RCU readers. - -

-This guarantee was only partially premeditated. -DYNIX/ptx used an explicit memory barrier for publication, but had nothing -resembling rcu_dereference() for subscription, nor did it -have anything resembling the smp_read_barrier_depends() -that was later subsumed into rcu_dereference() and later -still into READ_ONCE(). -The need for these operations made itself known quite suddenly at a -late-1990s meeting with the DEC Alpha architects, back in the days when -DEC was still a free-standing company. -It took the Alpha architects a good hour to convince me that any sort -of barrier would ever be needed, and it then took me a good two hours -to convince them that their documentation did not make this point clear. -More recent work with the C and C++ standards committees have provided -much education on tricks and traps from the compiler. -In short, compilers were much less tricky in the early 1990s, but in -2015, don't even think about omitting rcu_dereference()! - -

Memory-Barrier Guarantees

- -

-The previous section's simple linked-data-structure scenario clearly -demonstrates the need for RCU's stringent memory-ordering guarantees on -systems with more than one CPU: - -

    -
  1. Each CPU that has an RCU read-side critical section that - begins before synchronize_rcu() starts is - guaranteed to execute a full memory barrier between the time - that the RCU read-side critical section ends and the time that - synchronize_rcu() returns. - Without this guarantee, a pre-existing RCU read-side critical section - might hold a reference to the newly removed struct foo - after the kfree() on line 14 of - remove_gp_synchronous(). -
  2. Each CPU that has an RCU read-side critical section that ends - after synchronize_rcu() returns is guaranteed - to execute a full memory barrier between the time that - synchronize_rcu() begins and the time that the RCU - read-side critical section begins. - Without this guarantee, a later RCU read-side critical section - running after the kfree() on line 14 of - remove_gp_synchronous() might - later run do_something_gp() and find the - newly deleted struct foo. -
  3. If the task invoking synchronize_rcu() remains - on a given CPU, then that CPU is guaranteed to execute a full - memory barrier sometime during the execution of - synchronize_rcu(). - This guarantee ensures that the kfree() on - line 14 of remove_gp_synchronous() really does - execute after the removal on line 11. -
  4. If the task invoking synchronize_rcu() migrates - among a group of CPUs during that invocation, then each of the - CPUs in that group is guaranteed to execute a full memory barrier - sometime during the execution of synchronize_rcu(). - This guarantee also ensures that the kfree() on - line 14 of remove_gp_synchronous() really does - execute after the removal on - line 11, but also in the case where the thread executing the - synchronize_rcu() migrates in the meantime. -
- - - - - - - - -
 
Quick Quiz:
- Given that multiple CPUs can start RCU read-side critical sections - at any time without any ordering whatsoever, how can RCU possibly - tell whether or not a given RCU read-side critical section starts - before a given instance of synchronize_rcu()? -
Answer:
- If RCU cannot tell whether or not a given - RCU read-side critical section starts before a - given instance of synchronize_rcu(), - then it must assume that the RCU read-side critical section - started first. - In other words, a given instance of synchronize_rcu() - can avoid waiting on a given RCU read-side critical section only - if it can prove that synchronize_rcu() started first. - - -

- A related question is “When rcu_read_lock() - doesn't generate any code, why does it matter how it relates - to a grace period?” - The answer is that it is not the relationship of - rcu_read_lock() itself that is important, but rather - the relationship of the code within the enclosed RCU read-side - critical section to the code preceding and following the - grace period. - If we take this viewpoint, then a given RCU read-side critical - section begins before a given grace period when some access - preceding the grace period observes the effect of some access - within the critical section, in which case none of the accesses - within the critical section may observe the effects of any - access following the grace period. - - -

- As of late 2016, mathematical models of RCU take this - viewpoint, for example, see slides 62 and 63 - of the - 2016 LinuxCon EU - presentation. -

 
- - - - - - - - -
 
Quick Quiz:
- The first and second guarantees require unbelievably strict ordering! - Are all these memory barriers really required? -
Answer:
- Yes, they really are required. - To see why the first guarantee is required, consider the following - sequence of events: - - -
    -
  1. - CPU 1: rcu_read_lock() - -
  2. - CPU 1: q = rcu_dereference(gp); - /* Very likely to return p. */ - -
  3. - CPU 0: list_del_rcu(p); - -
  4. - CPU 0: synchronize_rcu() starts. - -
  5. - CPU 1: do_something_with(q->a); - /* No smp_mb(), so might happen after kfree(). */ - -
  6. - CPU 1: rcu_read_unlock() - -
  7. - CPU 0: synchronize_rcu() returns. - -
  8. - CPU 0: kfree(p); - -
- -

- Therefore, there absolutely must be a full memory barrier between the - end of the RCU read-side critical section and the end of the - grace period. - - -

- The sequence of events demonstrating the necessity of the second rule - is roughly similar: - - -

    -
  1. CPU 0: list_del_rcu(p); - -
  2. CPU 0: synchronize_rcu() starts. - -
  3. CPU 1: rcu_read_lock() - -
  4. CPU 1: q = rcu_dereference(gp); - /* Might return p if no memory barrier. */ - -
  5. CPU 0: synchronize_rcu() returns. - -
  6. CPU 0: kfree(p); - -
  7. - CPU 1: do_something_with(q->a); /* Boom!!! */ - -
  8. CPU 1: rcu_read_unlock() - -
- -

- And similarly, without a memory barrier between the beginning of the - grace period and the beginning of the RCU read-side critical section, - CPU 1 might end up accessing the freelist. - - -

- The “as if” rule of course applies, so that any - implementation that acts as if the appropriate memory barriers - were in place is a correct implementation. - That said, it is much easier to fool yourself into believing - that you have adhered to the as-if rule than it is to actually - adhere to it! -

 
- - - - - - - - -
 
Quick Quiz:
- You claim that rcu_read_lock() and rcu_read_unlock() - generate absolutely no code in some kernel builds. - This means that the compiler might arbitrarily rearrange consecutive - RCU read-side critical sections. - Given such rearrangement, if a given RCU read-side critical section - is done, how can you be sure that all prior RCU read-side critical - sections are done? - Won't the compiler rearrangements make that impossible to determine? -
Answer:
- In cases where rcu_read_lock() and rcu_read_unlock() - generate absolutely no code, RCU infers quiescent states only at - special locations, for example, within the scheduler. - Because calls to schedule() had better prevent calling-code - accesses to shared variables from being rearranged across the call to - schedule(), if RCU detects the end of a given RCU read-side - critical section, it will necessarily detect the end of all prior - RCU read-side critical sections, no matter how aggressively the - compiler scrambles the code. - - -

- Again, this all assumes that the compiler cannot scramble code across - calls to the scheduler, out of interrupt handlers, into the idle loop, - into user-mode code, and so on. - But if your kernel build allows that sort of scrambling, you have broken - far more than just RCU! -

 
- -

-Note that these memory-barrier requirements do not replace the fundamental -RCU requirement that a grace period wait for all pre-existing readers. -On the contrary, the memory barriers called out in this section must operate in -such a way as to enforce this fundamental requirement. -Of course, different implementations enforce this requirement in different -ways, but enforce it they must. - -

RCU Primitives Guaranteed to Execute Unconditionally

- -

-The common-case RCU primitives are unconditional. -They are invoked, they do their job, and they return, with no possibility -of error, and no need to retry. -This is a key RCU design philosophy. - -

-However, this philosophy is pragmatic rather than pigheaded. -If someone comes up with a good justification for a particular conditional -RCU primitive, it might well be implemented and added. -After all, this guarantee was reverse-engineered, not premeditated. -The unconditional nature of the RCU primitives was initially an -accident of implementation, and later experience with synchronization -primitives with conditional primitives caused me to elevate this -accident to a guarantee. -Therefore, the justification for adding a conditional primitive to -RCU would need to be based on detailed and compelling use cases. - -

Guaranteed Read-to-Write Upgrade

- -

-As far as RCU is concerned, it is always possible to carry out an -update within an RCU read-side critical section. -For example, that RCU read-side critical section might search for -a given data element, and then might acquire the update-side -spinlock in order to update that element, all while remaining -in that RCU read-side critical section. -Of course, it is necessary to exit the RCU read-side critical section -before invoking synchronize_rcu(), however, this -inconvenience can be avoided through use of the -call_rcu() and kfree_rcu() API members -described later in this document. - - - - - - - - -
 
Quick Quiz:
- But how does the upgrade-to-write operation exclude other readers? -
Answer:
- It doesn't, just like normal RCU updates, which also do not exclude - RCU readers. -
 
- -

-This guarantee allows lookup code to be shared between read-side -and update-side code, and was premeditated, appearing in the earliest -DYNIX/ptx RCU documentation. - -

Fundamental Non-Requirements

- -

-RCU provides extremely lightweight readers, and its read-side guarantees, -though quite useful, are correspondingly lightweight. -It is therefore all too easy to assume that RCU is guaranteeing more -than it really is. -Of course, the list of things that RCU does not guarantee is infinitely -long, however, the following sections list a few non-guarantees that -have caused confusion. -Except where otherwise noted, these non-guarantees were premeditated. - -

    -
  1. - Readers Impose Minimal Ordering -
  2. - Readers Do Not Exclude Updaters -
  3. - Updaters Only Wait For Old Readers -
  4. - Grace Periods Don't Partition Read-Side Critical Sections -
  5. - Read-Side Critical Sections Don't Partition Grace Periods -
- -

Readers Impose Minimal Ordering

- -

-Reader-side markers such as rcu_read_lock() and -rcu_read_unlock() provide absolutely no ordering guarantees -except through their interaction with the grace-period APIs such as -synchronize_rcu(). -To see this, consider the following pair of threads: - -

-
- 1 void thread0(void)
- 2 {
- 3   rcu_read_lock();
- 4   WRITE_ONCE(x, 1);
- 5   rcu_read_unlock();
- 6   rcu_read_lock();
- 7   WRITE_ONCE(y, 1);
- 8   rcu_read_unlock();
- 9 }
-10
-11 void thread1(void)
-12 {
-13   rcu_read_lock();
-14   r1 = READ_ONCE(y);
-15   rcu_read_unlock();
-16   rcu_read_lock();
-17   r2 = READ_ONCE(x);
-18   rcu_read_unlock();
-19 }
-
-
- -

-After thread0() and thread1() execute -concurrently, it is quite possible to have - -

-
-(r1 == 1 && r2 == 0)
-
-
- -(that is, y appears to have been assigned before x), -which would not be possible if rcu_read_lock() and -rcu_read_unlock() had much in the way of ordering -properties. -But they do not, so the CPU is within its rights -to do significant reordering. -This is by design: Any significant ordering constraints would slow down -these fast-path APIs. - - - - - - - - -
 
Quick Quiz:
- Can't the compiler also reorder this code? -
Answer:
- No, the volatile casts in READ_ONCE() and - WRITE_ONCE() prevent the compiler from reordering in - this particular case. -
 
- -

Readers Do Not Exclude Updaters

- -

-Neither rcu_read_lock() nor rcu_read_unlock() -exclude updates. -All they do is to prevent grace periods from ending. -The following example illustrates this: - -

-
- 1 void thread0(void)
- 2 {
- 3   rcu_read_lock();
- 4   r1 = READ_ONCE(y);
- 5   if (r1) {
- 6     do_something_with_nonzero_x();
- 7     r2 = READ_ONCE(x);
- 8     WARN_ON(!r2); /* BUG!!! */
- 9   }
-10   rcu_read_unlock();
-11 }
-12
-13 void thread1(void)
-14 {
-15   spin_lock(&my_lock);
-16   WRITE_ONCE(x, 1);
-17   WRITE_ONCE(y, 1);
-18   spin_unlock(&my_lock);
-19 }
-
-
- -

-If the thread0() function's rcu_read_lock() -excluded the thread1() function's update, -the WARN_ON() could never fire. -But the fact is that rcu_read_lock() does not exclude -much of anything aside from subsequent grace periods, of which -thread1() has none, so the -WARN_ON() can and does fire. - -

Updaters Only Wait For Old Readers

- -

-It might be tempting to assume that after synchronize_rcu() -completes, there are no readers executing. -This temptation must be avoided because -new readers can start immediately after synchronize_rcu() -starts, and synchronize_rcu() is under no -obligation to wait for these new readers. - - - - - - - - -
 
Quick Quiz:
- Suppose that synchronize_rcu() did wait until all - readers had completed instead of waiting only on - pre-existing readers. - For how long would the updater be able to rely on there - being no readers? -
Answer:
- For no time at all. - Even if synchronize_rcu() were to wait until - all readers had completed, a new reader might start immediately after - synchronize_rcu() completed. - Therefore, the code following - synchronize_rcu() can never rely on there being - no readers. -
 
- -

-Grace Periods Don't Partition Read-Side Critical Sections

- -

-It is tempting to assume that if any part of one RCU read-side critical -section precedes a given grace period, and if any part of another RCU -read-side critical section follows that same grace period, then all of -the first RCU read-side critical section must precede all of the second. -However, this just isn't the case: A single grace period does not -partition the set of RCU read-side critical sections. -An example of this situation can be illustrated as follows, where -x, y, and z are initially all zero: - -

-
- 1 void thread0(void)
- 2 {
- 3   rcu_read_lock();
- 4   WRITE_ONCE(a, 1);
- 5   WRITE_ONCE(b, 1);
- 6   rcu_read_unlock();
- 7 }
- 8
- 9 void thread1(void)
-10 {
-11   r1 = READ_ONCE(a);
-12   synchronize_rcu();
-13   WRITE_ONCE(c, 1);
-14 }
-15
-16 void thread2(void)
-17 {
-18   rcu_read_lock();
-19   r2 = READ_ONCE(b);
-20   r3 = READ_ONCE(c);
-21   rcu_read_unlock();
-22 }
-
-
- -

-It turns out that the outcome: - -

-
-(r1 == 1 && r2 == 0 && r3 == 1)
-
-
- -is entirely possible. -The following figure show how this can happen, with each circled -QS indicating the point at which RCU recorded a -quiescent state for each thread, that is, a state in which -RCU knows that the thread cannot be in the midst of an RCU read-side -critical section that started before the current grace period: - -

GPpartitionReaders1.svg

- -

-If it is necessary to partition RCU read-side critical sections in this -manner, it is necessary to use two grace periods, where the first -grace period is known to end before the second grace period starts: - -

-
- 1 void thread0(void)
- 2 {
- 3   rcu_read_lock();
- 4   WRITE_ONCE(a, 1);
- 5   WRITE_ONCE(b, 1);
- 6   rcu_read_unlock();
- 7 }
- 8
- 9 void thread1(void)
-10 {
-11   r1 = READ_ONCE(a);
-12   synchronize_rcu();
-13   WRITE_ONCE(c, 1);
-14 }
-15
-16 void thread2(void)
-17 {
-18   r2 = READ_ONCE(c);
-19   synchronize_rcu();
-20   WRITE_ONCE(d, 1);
-21 }
-22
-23 void thread3(void)
-24 {
-25   rcu_read_lock();
-26   r3 = READ_ONCE(b);
-27   r4 = READ_ONCE(d);
-28   rcu_read_unlock();
-29 }
-
-
- -

-Here, if (r1 == 1), then -thread0()'s write to b must happen -before the end of thread1()'s grace period. -If in addition (r4 == 1), then -thread3()'s read from b must happen -after the beginning of thread2()'s grace period. -If it is also the case that (r2 == 1), then the -end of thread1()'s grace period must precede the -beginning of thread2()'s grace period. -This mean that the two RCU read-side critical sections cannot overlap, -guaranteeing that (r3 == 1). -As a result, the outcome: - -

-
-(r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1)
-
-
- -cannot happen. - -

-This non-requirement was also non-premeditated, but became apparent -when studying RCU's interaction with memory ordering. - -

-Read-Side Critical Sections Don't Partition Grace Periods

- -

-It is also tempting to assume that if an RCU read-side critical section -happens between a pair of grace periods, then those grace periods cannot -overlap. -However, this temptation leads nowhere good, as can be illustrated by -the following, with all variables initially zero: - -

-
- 1 void thread0(void)
- 2 {
- 3   rcu_read_lock();
- 4   WRITE_ONCE(a, 1);
- 5   WRITE_ONCE(b, 1);
- 6   rcu_read_unlock();
- 7 }
- 8
- 9 void thread1(void)
-10 {
-11   r1 = READ_ONCE(a);
-12   synchronize_rcu();
-13   WRITE_ONCE(c, 1);
-14 }
-15
-16 void thread2(void)
-17 {
-18   rcu_read_lock();
-19   WRITE_ONCE(d, 1);
-20   r2 = READ_ONCE(c);
-21   rcu_read_unlock();
-22 }
-23
-24 void thread3(void)
-25 {
-26   r3 = READ_ONCE(d);
-27   synchronize_rcu();
-28   WRITE_ONCE(e, 1);
-29 }
-30
-31 void thread4(void)
-32 {
-33   rcu_read_lock();
-34   r4 = READ_ONCE(b);
-35   r5 = READ_ONCE(e);
-36   rcu_read_unlock();
-37 }
-
-
- -

-In this case, the outcome: - -

-
-(r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1)
-
-
- -is entirely possible, as illustrated below: - -

ReadersPartitionGP1.svg

- -

-Again, an RCU read-side critical section can overlap almost all of a -given grace period, just so long as it does not overlap the entire -grace period. -As a result, an RCU read-side critical section cannot partition a pair -of RCU grace periods. - - - - - - - - -
 
Quick Quiz:
- How long a sequence of grace periods, each separated by an RCU - read-side critical section, would be required to partition the RCU - read-side critical sections at the beginning and end of the chain? -
Answer:
- In theory, an infinite number. - In practice, an unknown number that is sensitive to both implementation - details and timing considerations. - Therefore, even in practice, RCU users must abide by the - theoretical rather than the practical answer. -
 
- -

Parallelism Facts of Life

- -

-These parallelism facts of life are by no means specific to RCU, but -the RCU implementation must abide by them. -They therefore bear repeating: - -

    -
  1. Any CPU or task may be delayed at any time, - and any attempts to avoid these delays by disabling - preemption, interrupts, or whatever are completely futile. - This is most obvious in preemptible user-level - environments and in virtualized environments (where - a given guest OS's VCPUs can be preempted at any time by - the underlying hypervisor), but can also happen in bare-metal - environments due to ECC errors, NMIs, and other hardware - events. - Although a delay of more than about 20 seconds can result - in splats, the RCU implementation is obligated to use - algorithms that can tolerate extremely long delays, but where - “extremely long” is not long enough to allow - wrap-around when incrementing a 64-bit counter. -
  2. Both the compiler and the CPU can reorder memory accesses. - Where it matters, RCU must use compiler directives and - memory-barrier instructions to preserve ordering. -
  3. Conflicting writes to memory locations in any given cache line - will result in expensive cache misses. - Greater numbers of concurrent writes and more-frequent - concurrent writes will result in more dramatic slowdowns. - RCU is therefore obligated to use algorithms that have - sufficient locality to avoid significant performance and - scalability problems. -
  4. As a rough rule of thumb, only one CPU's worth of processing - may be carried out under the protection of any given exclusive - lock. - RCU must therefore use scalable locking designs. -
  5. Counters are finite, especially on 32-bit systems. - RCU's use of counters must therefore tolerate counter wrap, - or be designed such that counter wrap would take way more - time than a single system is likely to run. - An uptime of ten years is quite possible, a runtime - of a century much less so. - As an example of the latter, RCU's dyntick-idle nesting counter - allows 54 bits for interrupt nesting level (this counter - is 64 bits even on a 32-bit system). - Overflowing this counter requires 254 - half-interrupts on a given CPU without that CPU ever going idle. - If a half-interrupt happened every microsecond, it would take - 570 years of runtime to overflow this counter, which is currently - believed to be an acceptably long time. -
  6. Linux systems can have thousands of CPUs running a single - Linux kernel in a single shared-memory environment. - RCU must therefore pay close attention to high-end scalability. -
- -

-This last parallelism fact of life means that RCU must pay special -attention to the preceding facts of life. -The idea that Linux might scale to systems with thousands of CPUs would -have been met with some skepticism in the 1990s, but these requirements -would have otherwise have been unsurprising, even in the early 1990s. - -

Quality-of-Implementation Requirements

- -

-These sections list quality-of-implementation requirements. -Although an RCU implementation that ignores these requirements could -still be used, it would likely be subject to limitations that would -make it inappropriate for industrial-strength production use. -Classes of quality-of-implementation requirements are as follows: - -

    -
  1. Specialization -
  2. Performance and Scalability -
  3. Forward Progress -
  4. Composability -
  5. Corner Cases -
- -

-These classes is covered in the following sections. - -

Specialization

- -

-RCU is and always has been intended primarily for read-mostly situations, -which means that RCU's read-side primitives are optimized, often at the -expense of its update-side primitives. -Experience thus far is captured by the following list of situations: - -

    -
  1. Read-mostly data, where stale and inconsistent data is not - a problem: RCU works great! -
  2. Read-mostly data, where data must be consistent: - RCU works well. -
  3. Read-write data, where data must be consistent: - RCU might work OK. - Or not. -
  4. Write-mostly data, where data must be consistent: - RCU is very unlikely to be the right tool for the job, - with the following exceptions, where RCU can provide: -
      -
    1. Existence guarantees for update-friendly mechanisms. -
    2. Wait-free read-side primitives for real-time use. -
    -
- -

-This focus on read-mostly situations means that RCU must interoperate -with other synchronization primitives. -For example, the add_gp() and remove_gp_synchronous() -examples discussed earlier use RCU to protect readers and locking to -coordinate updaters. -However, the need extends much farther, requiring that a variety of -synchronization primitives be legal within RCU read-side critical sections, -including spinlocks, sequence locks, atomic operations, reference -counters, and memory barriers. - - - - - - - - -
 
Quick Quiz:
- What about sleeping locks? -
Answer:
- These are forbidden within Linux-kernel RCU read-side critical - sections because it is not legal to place a quiescent state - (in this case, voluntary context switch) within an RCU read-side - critical section. - However, sleeping locks may be used within userspace RCU read-side - critical sections, and also within Linux-kernel sleepable RCU - (SRCU) - read-side critical sections. - In addition, the -rt patchset turns spinlocks into a - sleeping locks so that the corresponding critical sections - can be preempted, which also means that these sleeplockified - spinlocks (but not other sleeping locks!) may be acquire within - -rt-Linux-kernel RCU read-side critical sections. - - -

- Note that it is legal for a normal RCU read-side - critical section to conditionally acquire a sleeping locks - (as in mutex_trylock()), but only as long as it does - not loop indefinitely attempting to conditionally acquire that - sleeping locks. - The key point is that things like mutex_trylock() - either return with the mutex held, or return an error indication if - the mutex was not immediately available. - Either way, mutex_trylock() returns immediately without - sleeping. -

 
- -

-It often comes as a surprise that many algorithms do not require a -consistent view of data, but many can function in that mode, -with network routing being the poster child. -Internet routing algorithms take significant time to propagate -updates, so that by the time an update arrives at a given system, -that system has been sending network traffic the wrong way for -a considerable length of time. -Having a few threads continue to send traffic the wrong way for a -few more milliseconds is clearly not a problem: In the worst case, -TCP retransmissions will eventually get the data where it needs to go. -In general, when tracking the state of the universe outside of the -computer, some level of inconsistency must be tolerated due to -speed-of-light delays if nothing else. - -

-Furthermore, uncertainty about external state is inherent in many cases. -For example, a pair of veterinarians might use heartbeat to determine -whether or not a given cat was alive. -But how long should they wait after the last heartbeat to decide that -the cat is in fact dead? -Waiting less than 400 milliseconds makes no sense because this would -mean that a relaxed cat would be considered to cycle between death -and life more than 100 times per minute. -Moreover, just as with human beings, a cat's heart might stop for -some period of time, so the exact wait period is a judgment call. -One of our pair of veterinarians might wait 30 seconds before pronouncing -the cat dead, while the other might insist on waiting a full minute. -The two veterinarians would then disagree on the state of the cat during -the final 30 seconds of the minute following the last heartbeat. - -

-Interestingly enough, this same situation applies to hardware. -When push comes to shove, how do we tell whether or not some -external server has failed? -We send messages to it periodically, and declare it failed if we -don't receive a response within a given period of time. -Policy decisions can usually tolerate short -periods of inconsistency. -The policy was decided some time ago, and is only now being put into -effect, so a few milliseconds of delay is normally inconsequential. - -

-However, there are algorithms that absolutely must see consistent data. -For example, the translation between a user-level SystemV semaphore -ID to the corresponding in-kernel data structure is protected by RCU, -but it is absolutely forbidden to update a semaphore that has just been -removed. -In the Linux kernel, this need for consistency is accommodated by acquiring -spinlocks located in the in-kernel data structure from within -the RCU read-side critical section, and this is indicated by the -green box in the figure above. -Many other techniques may be used, and are in fact used within the -Linux kernel. - -

-In short, RCU is not required to maintain consistency, and other -mechanisms may be used in concert with RCU when consistency is required. -RCU's specialization allows it to do its job extremely well, and its -ability to interoperate with other synchronization mechanisms allows -the right mix of synchronization tools to be used for a given job. - -

Performance and Scalability

- -

-Energy efficiency is a critical component of performance today, -and Linux-kernel RCU implementations must therefore avoid unnecessarily -awakening idle CPUs. -I cannot claim that this requirement was premeditated. -In fact, I learned of it during a telephone conversation in which I -was given “frank and open” feedback on the importance -of energy efficiency in battery-powered systems and on specific -energy-efficiency shortcomings of the Linux-kernel RCU implementation. -In my experience, the battery-powered embedded community will consider -any unnecessary wakeups to be extremely unfriendly acts. -So much so that mere Linux-kernel-mailing-list posts are -insufficient to vent their ire. - -

-Memory consumption is not particularly important for in most -situations, and has become decreasingly -so as memory sizes have expanded and memory -costs have plummeted. -However, as I learned from Matt Mackall's -bloatwatch -efforts, memory footprint is critically important on single-CPU systems with -non-preemptible (CONFIG_PREEMPT=n) kernels, and thus -tiny RCU -was born. -Josh Triplett has since taken over the small-memory banner with his -Linux kernel tinification -project, which resulted in -SRCU -becoming optional for those kernels not needing it. - -

-The remaining performance requirements are, for the most part, -unsurprising. -For example, in keeping with RCU's read-side specialization, -rcu_dereference() should have negligible overhead (for -example, suppression of a few minor compiler optimizations). -Similarly, in non-preemptible environments, rcu_read_lock() and -rcu_read_unlock() should have exactly zero overhead. - -

-In preemptible environments, in the case where the RCU read-side -critical section was not preempted (as will be the case for the -highest-priority real-time process), rcu_read_lock() and -rcu_read_unlock() should have minimal overhead. -In particular, they should not contain atomic read-modify-write -operations, memory-barrier instructions, preemption disabling, -interrupt disabling, or backwards branches. -However, in the case where the RCU read-side critical section was preempted, -rcu_read_unlock() may acquire spinlocks and disable interrupts. -This is why it is better to nest an RCU read-side critical section -within a preempt-disable region than vice versa, at least in cases -where that critical section is short enough to avoid unduly degrading -real-time latencies. - -

-The synchronize_rcu() grace-period-wait primitive is -optimized for throughput. -It may therefore incur several milliseconds of latency in addition to -the duration of the longest RCU read-side critical section. -On the other hand, multiple concurrent invocations of -synchronize_rcu() are required to use batching optimizations -so that they can be satisfied by a single underlying grace-period-wait -operation. -For example, in the Linux kernel, it is not unusual for a single -grace-period-wait operation to serve more than -1,000 separate invocations -of synchronize_rcu(), thus amortizing the per-invocation -overhead down to nearly zero. -However, the grace-period optimization is also required to avoid -measurable degradation of real-time scheduling and interrupt latencies. - -

-In some cases, the multi-millisecond synchronize_rcu() -latencies are unacceptable. -In these cases, synchronize_rcu_expedited() may be used -instead, reducing the grace-period latency down to a few tens of -microseconds on small systems, at least in cases where the RCU read-side -critical sections are short. -There are currently no special latency requirements for -synchronize_rcu_expedited() on large systems, but, -consistent with the empirical nature of the RCU specification, -that is subject to change. -However, there most definitely are scalability requirements: -A storm of synchronize_rcu_expedited() invocations on 4096 -CPUs should at least make reasonable forward progress. -In return for its shorter latencies, synchronize_rcu_expedited() -is permitted to impose modest degradation of real-time latency -on non-idle online CPUs. -Here, “modest” means roughly the same latency -degradation as a scheduling-clock interrupt. - -

-There are a number of situations where even -synchronize_rcu_expedited()'s reduced grace-period -latency is unacceptable. -In these situations, the asynchronous call_rcu() can be -used in place of synchronize_rcu() as follows: - -

-
- 1 struct foo {
- 2   int a;
- 3   int b;
- 4   struct rcu_head rh;
- 5 };
- 6
- 7 static void remove_gp_cb(struct rcu_head *rhp)
- 8 {
- 9   struct foo *p = container_of(rhp, struct foo, rh);
-10
-11   kfree(p);
-12 }
-13
-14 bool remove_gp_asynchronous(void)
-15 {
-16   struct foo *p;
-17
-18   spin_lock(&gp_lock);
-19   p = rcu_access_pointer(gp);
-20   if (!p) {
-21     spin_unlock(&gp_lock);
-22     return false;
-23   }
-24   rcu_assign_pointer(gp, NULL);
-25   call_rcu(&p->rh, remove_gp_cb);
-26   spin_unlock(&gp_lock);
-27   return true;
-28 }
-
-
- -

-A definition of struct foo is finally needed, and appears -on lines 1-5. -The function remove_gp_cb() is passed to call_rcu() -on line 25, and will be invoked after the end of a subsequent -grace period. -This gets the same effect as remove_gp_synchronous(), -but without forcing the updater to wait for a grace period to elapse. -The call_rcu() function may be used in a number of -situations where neither synchronize_rcu() nor -synchronize_rcu_expedited() would be legal, -including within preempt-disable code, local_bh_disable() code, -interrupt-disable code, and interrupt handlers. -However, even call_rcu() is illegal within NMI handlers -and from idle and offline CPUs. -The callback function (remove_gp_cb() in this case) will be -executed within softirq (software interrupt) environment within the -Linux kernel, -either within a real softirq handler or under the protection -of local_bh_disable(). -In both the Linux kernel and in userspace, it is bad practice to -write an RCU callback function that takes too long. -Long-running operations should be relegated to separate threads or -(in the Linux kernel) workqueues. - - - - - - - - -
 
Quick Quiz:
- Why does line 19 use rcu_access_pointer()? - After all, call_rcu() on line 25 stores into the - structure, which would interact badly with concurrent insertions. - Doesn't this mean that rcu_dereference() is required? -
Answer:
- Presumably the ->gp_lock acquired on line 18 excludes - any changes, including any insertions that rcu_dereference() - would protect against. - Therefore, any insertions will be delayed until after - ->gp_lock - is released on line 25, which in turn means that - rcu_access_pointer() suffices. -
 
- -

-However, all that remove_gp_cb() is doing is -invoking kfree() on the data element. -This is a common idiom, and is supported by kfree_rcu(), -which allows “fire and forget” operation as shown below: - -

-
- 1 struct foo {
- 2   int a;
- 3   int b;
- 4   struct rcu_head rh;
- 5 };
- 6
- 7 bool remove_gp_faf(void)
- 8 {
- 9   struct foo *p;
-10
-11   spin_lock(&gp_lock);
-12   p = rcu_dereference(gp);
-13   if (!p) {
-14     spin_unlock(&gp_lock);
-15     return false;
-16   }
-17   rcu_assign_pointer(gp, NULL);
-18   kfree_rcu(p, rh);
-19   spin_unlock(&gp_lock);
-20   return true;
-21 }
-
-
- -

-Note that remove_gp_faf() simply invokes -kfree_rcu() and proceeds, without any need to pay any -further attention to the subsequent grace period and kfree(). -It is permissible to invoke kfree_rcu() from the same -environments as for call_rcu(). -Interestingly enough, DYNIX/ptx had the equivalents of -call_rcu() and kfree_rcu(), but not -synchronize_rcu(). -This was due to the fact that RCU was not heavily used within DYNIX/ptx, -so the very few places that needed something like -synchronize_rcu() simply open-coded it. - - - - - - - - -
 
Quick Quiz:
- Earlier it was claimed that call_rcu() and - kfree_rcu() allowed updaters to avoid being blocked - by readers. - But how can that be correct, given that the invocation of the callback - and the freeing of the memory (respectively) must still wait for - a grace period to elapse? -
Answer:
- We could define things this way, but keep in mind that this sort of - definition would say that updates in garbage-collected languages - cannot complete until the next time the garbage collector runs, - which does not seem at all reasonable. - The key point is that in most cases, an updater using either - call_rcu() or kfree_rcu() can proceed to the - next update as soon as it has invoked call_rcu() or - kfree_rcu(), without having to wait for a subsequent - grace period. -
 
- -

-But what if the updater must wait for the completion of code to be -executed after the end of the grace period, but has other tasks -that can be carried out in the meantime? -The polling-style get_state_synchronize_rcu() and -cond_synchronize_rcu() functions may be used for this -purpose, as shown below: - -

-
- 1 bool remove_gp_poll(void)
- 2 {
- 3   struct foo *p;
- 4   unsigned long s;
- 5
- 6   spin_lock(&gp_lock);
- 7   p = rcu_access_pointer(gp);
- 8   if (!p) {
- 9     spin_unlock(&gp_lock);
-10     return false;
-11   }
-12   rcu_assign_pointer(gp, NULL);
-13   spin_unlock(&gp_lock);
-14   s = get_state_synchronize_rcu();
-15   do_something_while_waiting();
-16   cond_synchronize_rcu(s);
-17   kfree(p);
-18   return true;
-19 }
-
-
- -

-On line 14, get_state_synchronize_rcu() obtains a -“cookie” from RCU, -then line 15 carries out other tasks, -and finally, line 16 returns immediately if a grace period has -elapsed in the meantime, but otherwise waits as required. -The need for get_state_synchronize_rcu and -cond_synchronize_rcu() has appeared quite recently, -so it is too early to tell whether they will stand the test of time. - -

-RCU thus provides a range of tools to allow updaters to strike the -required tradeoff between latency, flexibility and CPU overhead. - -

Forward Progress

- -

-In theory, delaying grace-period completion and callback invocation -is harmless. -In practice, not only are memory sizes finite but also callbacks sometimes -do wakeups, and sufficiently deferred wakeups can be difficult -to distinguish from system hangs. -Therefore, RCU must provide a number of mechanisms to promote forward -progress. - -

-These mechanisms are not foolproof, nor can they be. -For one simple example, an infinite loop in an RCU read-side critical -section must by definition prevent later grace periods from ever completing. -For a more involved example, consider a 64-CPU system built with -CONFIG_RCU_NOCB_CPU=y and booted with rcu_nocbs=1-63, -where CPUs 1 through 63 spin in tight loops that invoke -call_rcu(). -Even if these tight loops also contain calls to cond_resched() -(thus allowing grace periods to complete), CPU 0 simply will -not be able to invoke callbacks as fast as the other 63 CPUs can -register them, at least not until the system runs out of memory. -In both of these examples, the Spiderman principle applies: With great -power comes great responsibility. -However, short of this level of abuse, RCU is required to -ensure timely completion of grace periods and timely invocation of -callbacks. - -

-RCU takes the following steps to encourage timely completion of -grace periods: - -

    -
  1. If a grace period fails to complete within 100 milliseconds, - RCU causes future invocations of cond_resched() on - the holdout CPUs to provide an RCU quiescent state. - RCU also causes those CPUs' need_resched() invocations - to return true, but only after the corresponding CPU's - next scheduling-clock. -
  2. CPUs mentioned in the nohz_full kernel boot parameter - can run indefinitely in the kernel without scheduling-clock - interrupts, which defeats the above need_resched() - strategem. - RCU will therefore invoke resched_cpu() on any - nohz_full CPUs still holding out after - 109 milliseconds. -
  3. In kernels built with CONFIG_RCU_BOOST=y, if a given - task that has been preempted within an RCU read-side critical - section is holding out for more than 500 milliseconds, - RCU will resort to priority boosting. -
  4. If a CPU is still holding out 10 seconds into the grace - period, RCU will invoke resched_cpu() on it regardless - of its nohz_full state. -
- -

-The above values are defaults for systems running with HZ=1000. -They will vary as the value of HZ varies, and can also be -changed using the relevant Kconfig options and kernel boot parameters. -RCU currently does not do much sanity checking of these -parameters, so please use caution when changing them. -Note that these forward-progress measures are provided only for RCU, -not for -SRCU or -Tasks RCU. - -

-RCU takes the following steps in call_rcu() to encourage timely -invocation of callbacks when any given non-rcu_nocbs CPU has -10,000 callbacks, or has 10,000 more callbacks than it had the last time -encouragement was provided: - -

    -
  1. Starts a grace period, if one is not already in progress. -
  2. Forces immediate checking for quiescent states, rather than - waiting for three milliseconds to have elapsed since the - beginning of the grace period. -
  3. Immediately tags the CPU's callbacks with their grace period - completion numbers, rather than waiting for the RCU_SOFTIRQ - handler to get around to it. -
  4. Lifts callback-execution batch limits, which speeds up callback - invocation at the expense of degrading realtime response. -
- -

-Again, these are default values when running at HZ=1000, -and can be overridden. -Again, these forward-progress measures are provided only for RCU, -not for -SRCU or -Tasks RCU. -Even for RCU, callback-invocation forward progress for rcu_nocbs -CPUs is much less well-developed, in part because workloads benefiting -from rcu_nocbs CPUs tend to invoke call_rcu() -relatively infrequently. -If workloads emerge that need both rcu_nocbs CPUs and high -call_rcu() invocation rates, then additional forward-progress -work will be required. - -

Composability

- -

-Composability has received much attention in recent years, perhaps in part -due to the collision of multicore hardware with object-oriented techniques -designed in single-threaded environments for single-threaded use. -And in theory, RCU read-side critical sections may be composed, and in -fact may be nested arbitrarily deeply. -In practice, as with all real-world implementations of composable -constructs, there are limitations. - -

-Implementations of RCU for which rcu_read_lock() -and rcu_read_unlock() generate no code, such as -Linux-kernel RCU when CONFIG_PREEMPT=n, can be -nested arbitrarily deeply. -After all, there is no overhead. -Except that if all these instances of rcu_read_lock() -and rcu_read_unlock() are visible to the compiler, -compilation will eventually fail due to exhausting memory, -mass storage, or user patience, whichever comes first. -If the nesting is not visible to the compiler, as is the case with -mutually recursive functions each in its own translation unit, -stack overflow will result. -If the nesting takes the form of loops, perhaps in the guise of tail -recursion, either the control variable -will overflow or (in the Linux kernel) you will get an RCU CPU stall warning. -Nevertheless, this class of RCU implementations is one -of the most composable constructs in existence. - -

-RCU implementations that explicitly track nesting depth -are limited by the nesting-depth counter. -For example, the Linux kernel's preemptible RCU limits nesting to -INT_MAX. -This should suffice for almost all practical purposes. -That said, a consecutive pair of RCU read-side critical sections -between which there is an operation that waits for a grace period -cannot be enclosed in another RCU read-side critical section. -This is because it is not legal to wait for a grace period within -an RCU read-side critical section: To do so would result either -in deadlock or -in RCU implicitly splitting the enclosing RCU read-side critical -section, neither of which is conducive to a long-lived and prosperous -kernel. - -

-It is worth noting that RCU is not alone in limiting composability. -For example, many transactional-memory implementations prohibit -composing a pair of transactions separated by an irrevocable -operation (for example, a network receive operation). -For another example, lock-based critical sections can be composed -surprisingly freely, but only if deadlock is avoided. - -

-In short, although RCU read-side critical sections are highly composable, -care is required in some situations, just as is the case for any other -composable synchronization mechanism. - -

Corner Cases

- -

-A given RCU workload might have an endless and intense stream of -RCU read-side critical sections, perhaps even so intense that there -was never a point in time during which there was not at least one -RCU read-side critical section in flight. -RCU cannot allow this situation to block grace periods: As long as -all the RCU read-side critical sections are finite, grace periods -must also be finite. - -

-That said, preemptible RCU implementations could potentially result -in RCU read-side critical sections being preempted for long durations, -which has the effect of creating a long-duration RCU read-side -critical section. -This situation can arise only in heavily loaded systems, but systems using -real-time priorities are of course more vulnerable. -Therefore, RCU priority boosting is provided to help deal with this -case. -That said, the exact requirements on RCU priority boosting will likely -evolve as more experience accumulates. - -

-Other workloads might have very high update rates. -Although one can argue that such workloads should instead use -something other than RCU, the fact remains that RCU must -handle such workloads gracefully. -This requirement is another factor driving batching of grace periods, -but it is also the driving force behind the checks for large numbers -of queued RCU callbacks in the call_rcu() code path. -Finally, high update rates should not delay RCU read-side critical -sections, although some small read-side delays can occur when using -synchronize_rcu_expedited(), courtesy of this function's use -of smp_call_function_single(). - -

-Although all three of these corner cases were understood in the early -1990s, a simple user-level test consisting of close(open(path)) -in a tight loop -in the early 2000s suddenly provided a much deeper appreciation of the -high-update-rate corner case. -This test also motivated addition of some RCU code to react to high update -rates, for example, if a given CPU finds itself with more than 10,000 -RCU callbacks queued, it will cause RCU to take evasive action by -more aggressively starting grace periods and more aggressively forcing -completion of grace-period processing. -This evasive action causes the grace period to complete more quickly, -but at the cost of restricting RCU's batching optimizations, thus -increasing the CPU overhead incurred by that grace period. - -

-Software-Engineering Requirements

- -

-Between Murphy's Law and “To err is human”, it is necessary to -guard against mishaps and misuse: - -

    -
  1. It is all too easy to forget to use rcu_read_lock() - everywhere that it is needed, so kernels built with - CONFIG_PROVE_RCU=y will splat if - rcu_dereference() is used outside of an - RCU read-side critical section. - Update-side code can use rcu_dereference_protected(), - which takes a - lockdep expression - to indicate what is providing the protection. - If the indicated protection is not provided, a lockdep splat - is emitted. - -

    - Code shared between readers and updaters can use - rcu_dereference_check(), which also takes a - lockdep expression, and emits a lockdep splat if neither - rcu_read_lock() nor the indicated protection - is in place. - In addition, rcu_dereference_raw() is used in those - (hopefully rare) cases where the required protection cannot - be easily described. - Finally, rcu_read_lock_held() is provided to - allow a function to verify that it has been invoked within - an RCU read-side critical section. - I was made aware of this set of requirements shortly after Thomas - Gleixner audited a number of RCU uses. -

  2. A given function might wish to check for RCU-related preconditions - upon entry, before using any other RCU API. - The rcu_lockdep_assert() does this job, - asserting the expression in kernels having lockdep enabled - and doing nothing otherwise. -
  3. It is also easy to forget to use rcu_assign_pointer() - and rcu_dereference(), perhaps (incorrectly) - substituting a simple assignment. - To catch this sort of error, a given RCU-protected pointer may be - tagged with __rcu, after which sparse - will complain about simple-assignment accesses to that pointer. - Arnd Bergmann made me aware of this requirement, and also - supplied the needed - patch series. -
  4. Kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y - will splat if a data element is passed to call_rcu() - twice in a row, without a grace period in between. - (This error is similar to a double free.) - The corresponding rcu_head structures that are - dynamically allocated are automatically tracked, but - rcu_head structures allocated on the stack - must be initialized with init_rcu_head_on_stack() - and cleaned up with destroy_rcu_head_on_stack(). - Similarly, statically allocated non-stack rcu_head - structures must be initialized with init_rcu_head() - and cleaned up with destroy_rcu_head(). - Mathieu Desnoyers made me aware of this requirement, and also - supplied the needed - patch. -
  5. An infinite loop in an RCU read-side critical section will - eventually trigger an RCU CPU stall warning splat, with - the duration of “eventually” being controlled by the - RCU_CPU_STALL_TIMEOUT Kconfig option, or, - alternatively, by the - rcupdate.rcu_cpu_stall_timeout boot/sysfs - parameter. - However, RCU is not obligated to produce this splat - unless there is a grace period waiting on that particular - RCU read-side critical section. -

    - Some extreme workloads might intentionally delay - RCU grace periods, and systems running those workloads can - be booted with rcupdate.rcu_cpu_stall_suppress - to suppress the splats. - This kernel parameter may also be set via sysfs. - Furthermore, RCU CPU stall warnings are counter-productive - during sysrq dumps and during panics. - RCU therefore supplies the rcu_sysrq_start() and - rcu_sysrq_end() API members to be called before - and after long sysrq dumps. - RCU also supplies the rcu_panic() notifier that is - automatically invoked at the beginning of a panic to suppress - further RCU CPU stall warnings. - -

    - This requirement made itself known in the early 1990s, pretty - much the first time that it was necessary to debug a CPU stall. - That said, the initial implementation in DYNIX/ptx was quite - generic in comparison with that of Linux. -

  6. Although it would be very good to detect pointers leaking out - of RCU read-side critical sections, there is currently no - good way of doing this. - One complication is the need to distinguish between pointers - leaking and pointers that have been handed off from RCU to - some other synchronization mechanism, for example, reference - counting. -
  7. In kernels built with CONFIG_RCU_TRACE=y, RCU-related - information is provided via event tracing. -
  8. Open-coded use of rcu_assign_pointer() and - rcu_dereference() to create typical linked - data structures can be surprisingly error-prone. - Therefore, RCU-protected - linked lists - and, more recently, RCU-protected - hash tables - are available. - Many other special-purpose RCU-protected data structures are - available in the Linux kernel and the userspace RCU library. -
  9. Some linked structures are created at compile time, but still - require __rcu checking. - The RCU_POINTER_INITIALIZER() macro serves this - purpose. -
  10. It is not necessary to use rcu_assign_pointer() - when creating linked structures that are to be published via - a single external pointer. - The RCU_INIT_POINTER() macro is provided for - this task and also for assigning NULL pointers - at runtime. -
- -

-This not a hard-and-fast list: RCU's diagnostic capabilities will -continue to be guided by the number and type of usage bugs found -in real-world RCU usage. - -

Linux Kernel Complications

- -

-The Linux kernel provides an interesting environment for all kinds of -software, including RCU. -Some of the relevant points of interest are as follows: - -

    -
  1. Configuration. -
  2. Firmware Interface. -
  3. Early Boot. -
  4. - Interrupts and non-maskable interrupts (NMIs). -
  5. Loadable Modules. -
  6. Hotplug CPU. -
  7. Scheduler and RCU. -
  8. Tracing and RCU. -
  9. Energy Efficiency. -
  10. - Scheduling-Clock Interrupts and RCU. -
  11. Memory Efficiency. -
  12. - Performance, Scalability, Response Time, and Reliability. -
- -

-This list is probably incomplete, but it does give a feel for the -most notable Linux-kernel complications. -Each of the following sections covers one of the above topics. - -

Configuration

- -

-RCU's goal is automatic configuration, so that almost nobody -needs to worry about RCU's Kconfig options. -And for almost all users, RCU does in fact work well -“out of the box.” - -

-However, there are specialized use cases that are handled by -kernel boot parameters and Kconfig options. -Unfortunately, the Kconfig system will explicitly ask users -about new Kconfig options, which requires almost all of them -be hidden behind a CONFIG_RCU_EXPERT Kconfig option. - -

-This all should be quite obvious, but the fact remains that -Linus Torvalds recently had to -remind -me of this requirement. - -

Firmware Interface

- -

-In many cases, kernel obtains information about the system from the -firmware, and sometimes things are lost in translation. -Or the translation is accurate, but the original message is bogus. - -

-For example, some systems' firmware overreports the number of CPUs, -sometimes by a large factor. -If RCU naively believed the firmware, as it used to do, -it would create too many per-CPU kthreads. -Although the resulting system will still run correctly, the extra -kthreads needlessly consume memory and can cause confusion -when they show up in ps listings. - -

-RCU must therefore wait for a given CPU to actually come online before -it can allow itself to believe that the CPU actually exists. -The resulting “ghost CPUs” (which are never going to -come online) cause a number of -interesting complications. - -

Early Boot

- -

-The Linux kernel's boot sequence is an interesting process, -and RCU is used early, even before rcu_init() -is invoked. -In fact, a number of RCU's primitives can be used as soon as the -initial task's task_struct is available and the -boot CPU's per-CPU variables are set up. -The read-side primitives (rcu_read_lock(), -rcu_read_unlock(), rcu_dereference(), -and rcu_access_pointer()) will operate normally very early on, -as will rcu_assign_pointer(). - -

-Although call_rcu() may be invoked at any -time during boot, callbacks are not guaranteed to be invoked until after -all of RCU's kthreads have been spawned, which occurs at -early_initcall() time. -This delay in callback invocation is due to the fact that RCU does not -invoke callbacks until it is fully initialized, and this full initialization -cannot occur until after the scheduler has initialized itself to the -point where RCU can spawn and run its kthreads. -In theory, it would be possible to invoke callbacks earlier, -however, this is not a panacea because there would be severe restrictions -on what operations those callbacks could invoke. - -

-Perhaps surprisingly, synchronize_rcu() and -synchronize_rcu_expedited(), -will operate normally -during very early boot, the reason being that there is only one CPU -and preemption is disabled. -This means that the call synchronize_rcu() (or friends) -itself is a quiescent -state and thus a grace period, so the early-boot implementation can -be a no-op. - -

-However, once the scheduler has spawned its first kthread, this early -boot trick fails for synchronize_rcu() (as well as for -synchronize_rcu_expedited()) in CONFIG_PREEMPT=y -kernels. -The reason is that an RCU read-side critical section might be preempted, -which means that a subsequent synchronize_rcu() really does have -to wait for something, as opposed to simply returning immediately. -Unfortunately, synchronize_rcu() can't do this until all of -its kthreads are spawned, which doesn't happen until some time during -early_initcalls() time. -But this is no excuse: RCU is nevertheless required to correctly handle -synchronous grace periods during this time period. -Once all of its kthreads are up and running, RCU starts running -normally. - - - - - - - - -
 
Quick Quiz:
- How can RCU possibly handle grace periods before all of its - kthreads have been spawned??? -
Answer:
- Very carefully! - - -

- During the “dead zone” between the time that the - scheduler spawns the first task and the time that all of RCU's - kthreads have been spawned, all synchronous grace periods are - handled by the expedited grace-period mechanism. - At runtime, this expedited mechanism relies on workqueues, but - during the dead zone the requesting task itself drives the - desired expedited grace period. - Because dead-zone execution takes place within task context, - everything works. - Once the dead zone ends, expedited grace periods go back to - using workqueues, as is required to avoid problems that would - otherwise occur when a user task received a POSIX signal while - driving an expedited grace period. - - -

- And yes, this does mean that it is unhelpful to send POSIX - signals to random tasks between the time that the scheduler - spawns its first kthread and the time that RCU's kthreads - have all been spawned. - If there ever turns out to be a good reason for sending POSIX - signals during that time, appropriate adjustments will be made. - (If it turns out that POSIX signals are sent during this time for - no good reason, other adjustments will be made, appropriate - or otherwise.) -

 
- -

-I learned of these boot-time requirements as a result of a series of -system hangs. - -

Interrupts and NMIs

- -

-The Linux kernel has interrupts, and RCU read-side critical sections are -legal within interrupt handlers and within interrupt-disabled regions -of code, as are invocations of call_rcu(). - -

-Some Linux-kernel architectures can enter an interrupt handler from -non-idle process context, and then just never leave it, instead stealthily -transitioning back to process context. -This trick is sometimes used to invoke system calls from inside the kernel. -These “half-interrupts” mean that RCU has to be very careful -about how it counts interrupt nesting levels. -I learned of this requirement the hard way during a rewrite -of RCU's dyntick-idle code. - -

-The Linux kernel has non-maskable interrupts (NMIs), and -RCU read-side critical sections are legal within NMI handlers. -Thankfully, RCU update-side primitives, including -call_rcu(), are prohibited within NMI handlers. - -

-The name notwithstanding, some Linux-kernel architectures -can have nested NMIs, which RCU must handle correctly. -Andy Lutomirski -surprised me -with this requirement; -he also kindly surprised me with -an algorithm -that meets this requirement. - -

-Furthermore, NMI handlers can be interrupted by what appear to RCU -to be normal interrupts. -One way that this can happen is for code that directly invokes -rcu_irq_enter() and rcu_irq_exit() to be called -from an NMI handler. -This astonishing fact of life prompted the current code structure, -which has rcu_irq_enter() invoking rcu_nmi_enter() -and rcu_irq_exit() invoking rcu_nmi_exit(). -And yes, I also learned of this requirement the hard way. - -

Loadable Modules

- -

-The Linux kernel has loadable modules, and these modules can -also be unloaded. -After a given module has been unloaded, any attempt to call -one of its functions results in a segmentation fault. -The module-unload functions must therefore cancel any -delayed calls to loadable-module functions, for example, -any outstanding mod_timer() must be dealt with -via del_timer_sync() or similar. - -

-Unfortunately, there is no way to cancel an RCU callback; -once you invoke call_rcu(), the callback function is -eventually going to be invoked, unless the system goes down first. -Because it is normally considered socially irresponsible to crash the system -in response to a module unload request, we need some other way -to deal with in-flight RCU callbacks. - -

-RCU therefore provides -rcu_barrier(), -which waits until all in-flight RCU callbacks have been invoked. -If a module uses call_rcu(), its exit function should therefore -prevent any future invocation of call_rcu(), then invoke -rcu_barrier(). -In theory, the underlying module-unload code could invoke -rcu_barrier() unconditionally, but in practice this would -incur unacceptable latencies. - -

-Nikita Danilov noted this requirement for an analogous filesystem-unmount -situation, and Dipankar Sarma incorporated rcu_barrier() into RCU. -The need for rcu_barrier() for module unloading became -apparent later. - -

-Important note: The rcu_barrier() function is not, -repeat, not, obligated to wait for a grace period. -It is instead only required to wait for RCU callbacks that have -already been posted. -Therefore, if there are no RCU callbacks posted anywhere in the system, -rcu_barrier() is within its rights to return immediately. -Even if there are callbacks posted, rcu_barrier() does not -necessarily need to wait for a grace period. - - - - - - - - -
 
Quick Quiz:
- Wait a minute! - Each RCU callbacks must wait for a grace period to complete, - and rcu_barrier() must wait for each pre-existing - callback to be invoked. - Doesn't rcu_barrier() therefore need to wait for - a full grace period if there is even one callback posted anywhere - in the system? -
Answer:
- Absolutely not!!! - - -

- Yes, each RCU callbacks must wait for a grace period to complete, - but it might well be partly (or even completely) finished waiting - by the time rcu_barrier() is invoked. - In that case, rcu_barrier() need only wait for the - remaining portion of the grace period to elapse. - So even if there are quite a few callbacks posted, - rcu_barrier() might well return quite quickly. - - -

- So if you need to wait for a grace period as well as for all - pre-existing callbacks, you will need to invoke both - synchronize_rcu() and rcu_barrier(). - If latency is a concern, you can always use workqueues - to invoke them concurrently. -

 
- -

Hotplug CPU

- -

-The Linux kernel supports CPU hotplug, which means that CPUs -can come and go. -It is of course illegal to use any RCU API member from an offline CPU, -with the exception of SRCU read-side -critical sections. -This requirement was present from day one in DYNIX/ptx, but -on the other hand, the Linux kernel's CPU-hotplug implementation -is “interesting.” - -

-The Linux-kernel CPU-hotplug implementation has notifiers that -are used to allow the various kernel subsystems (including RCU) -to respond appropriately to a given CPU-hotplug operation. -Most RCU operations may be invoked from CPU-hotplug notifiers, -including even synchronous grace-period operations such as -synchronize_rcu() and synchronize_rcu_expedited(). - -

-However, all-callback-wait operations such as -rcu_barrier() are also not supported, due to the -fact that there are phases of CPU-hotplug operations where -the outgoing CPU's callbacks will not be invoked until after -the CPU-hotplug operation ends, which could also result in deadlock. -Furthermore, rcu_barrier() blocks CPU-hotplug operations -during its execution, which results in another type of deadlock -when invoked from a CPU-hotplug notifier. - -

Scheduler and RCU

- -

-RCU depends on the scheduler, and the scheduler uses RCU to -protect some of its data structures. -The preemptible-RCU rcu_read_unlock() -implementation must therefore be written carefully to avoid deadlocks -involving the scheduler's runqueue and priority-inheritance locks. -In particular, rcu_read_unlock() must tolerate an -interrupt where the interrupt handler invokes both -rcu_read_lock() and rcu_read_unlock(). -This possibility requires rcu_read_unlock() to use -negative nesting levels to avoid destructive recursion via -interrupt handler's use of RCU. - -

-This scheduler-RCU requirement came as a -complete surprise. - -

-As noted above, RCU makes use of kthreads, and it is necessary to -avoid excessive CPU-time accumulation by these kthreads. -This requirement was no surprise, but RCU's violation of it -when running context-switch-heavy workloads when built with -CONFIG_NO_HZ_FULL=y -did come as a surprise [PDF]. -RCU has made good progress towards meeting this requirement, even -for context-switch-heavy CONFIG_NO_HZ_FULL=y workloads, -but there is room for further improvement. - -

-It is forbidden to hold any of scheduler's runqueue or priority-inheritance -spinlocks across an rcu_read_unlock() unless interrupts have been -disabled across the entire RCU read-side critical section, that is, -up to and including the matching rcu_read_lock(). -Violating this restriction can result in deadlocks involving these -scheduler spinlocks. -There was hope that this restriction might be lifted when interrupt-disabled -calls to rcu_read_unlock() started deferring the reporting of -the resulting RCU-preempt quiescent state until the end of the corresponding -interrupts-disabled region. -Unfortunately, timely reporting of the corresponding quiescent state -to expedited grace periods requires a call to raise_softirq(), -which can acquire these scheduler spinlocks. -In addition, real-time systems using RCU priority boosting -need this restriction to remain in effect because deferred -quiescent-state reporting would also defer deboosting, which in turn -would degrade real-time latencies. - -

-In theory, if a given RCU read-side critical section could be -guaranteed to be less than one second in duration, holding a scheduler -spinlock across that critical section's rcu_read_unlock() -would require only that preemption be disabled across the entire -RCU read-side critical section, not interrupts. -Unfortunately, given the possibility of vCPU preemption, long-running -interrupts, and so on, it is not possible in practice to guarantee -that a given RCU read-side critical section will complete in less than -one second. -Therefore, as noted above, if scheduler spinlocks are held across -a given call to rcu_read_unlock(), interrupts must be -disabled across the entire RCU read-side critical section. - -

Tracing and RCU

- -

-It is possible to use tracing on RCU code, but tracing itself -uses RCU. -For this reason, rcu_dereference_raw_notrace() -is provided for use by tracing, which avoids the destructive -recursion that could otherwise ensue. -This API is also used by virtualization in some architectures, -where RCU readers execute in environments in which tracing -cannot be used. -The tracing folks both located the requirement and provided the -needed fix, so this surprise requirement was relatively painless. - -

Energy Efficiency

- -

-Interrupting idle CPUs is considered socially unacceptable, -especially by people with battery-powered embedded systems. -RCU therefore conserves energy by detecting which CPUs are -idle, including tracking CPUs that have been interrupted from idle. -This is a large part of the energy-efficiency requirement, -so I learned of this via an irate phone call. - -

-Because RCU avoids interrupting idle CPUs, it is illegal to -execute an RCU read-side critical section on an idle CPU. -(Kernels built with CONFIG_PROVE_RCU=y will splat -if you try it.) -The RCU_NONIDLE() macro and _rcuidle -event tracing is provided to work around this restriction. -In addition, rcu_is_watching() may be used to -test whether or not it is currently legal to run RCU read-side -critical sections on this CPU. -I learned of the need for diagnostics on the one hand -and RCU_NONIDLE() on the other while inspecting -idle-loop code. -Steven Rostedt supplied _rcuidle event tracing, -which is used quite heavily in the idle loop. -However, there are some restrictions on the code placed within -RCU_NONIDLE(): - -

    -
  1. Blocking is prohibited. - In practice, this is not a serious restriction given that idle - tasks are prohibited from blocking to begin with. -
  2. Although nesting RCU_NONIDLE() is permitted, they cannot - nest indefinitely deeply. - However, given that they can be nested on the order of a million - deep, even on 32-bit systems, this should not be a serious - restriction. - This nesting limit would probably be reached long after the - compiler OOMed or the stack overflowed. -
  3. Any code path that enters RCU_NONIDLE() must sequence - out of that same RCU_NONIDLE(). - For example, the following is grossly illegal: - -
    -
    - 1     RCU_NONIDLE({
    - 2       do_something();
    - 3       goto bad_idea;  /* BUG!!! */
    - 4       do_something_else();});
    - 5   bad_idea:
    -	
    -
    - -

    - It is just as illegal to transfer control into the middle of - RCU_NONIDLE()'s argument. - Yes, in theory, you could transfer in as long as you also - transferred out, but in practice you could also expect to get sharply - worded review comments. -

- -

-It is similarly socially unacceptable to interrupt an -nohz_full CPU running in userspace. -RCU must therefore track nohz_full userspace -execution. -RCU must therefore be able to sample state at two points in -time, and be able to determine whether or not some other CPU spent -any time idle and/or executing in userspace. - -

-These energy-efficiency requirements have proven quite difficult to -understand and to meet, for example, there have been more than five -clean-sheet rewrites of RCU's energy-efficiency code, the last of -which was finally able to demonstrate -real energy savings running on real hardware [PDF]. -As noted earlier, -I learned of many of these requirements via angry phone calls: -Flaming me on the Linux-kernel mailing list was apparently not -sufficient to fully vent their ire at RCU's energy-efficiency bugs! - -

-Scheduling-Clock Interrupts and RCU

- -

-The kernel transitions between in-kernel non-idle execution, userspace -execution, and the idle loop. -Depending on kernel configuration, RCU handles these states differently: - - - - - - - - - - - - - - - - - - -
HZ KconfigIn-KernelUsermodeIdle
HZ_PERIODICCan rely on scheduling-clock interrupt.Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.Can rely on RCU's dyntick-idle detection.
NO_HZ_IDLECan rely on scheduling-clock interrupt.Can rely on scheduling-clock interrupt and its - detection of interrupt from usermode.Can rely on RCU's dyntick-idle detection.
NO_HZ_FULLCan only sometimes rely on scheduling-clock interrupt. - In other cases, it is necessary to bound kernel execution - times and/or use IPIs.Can rely on RCU's dyntick-idle detection.Can rely on RCU's dyntick-idle detection.
- - - - - - - - -
 
Quick Quiz:
- Why can't NO_HZ_FULL in-kernel execution rely on the - scheduling-clock interrupt, just like HZ_PERIODIC - and NO_HZ_IDLE do? -
Answer:
- Because, as a performance optimization, NO_HZ_FULL - does not necessarily re-enable the scheduling-clock interrupt - on entry to each and every system call. -
 
- -

-However, RCU must be reliably informed as to whether any given -CPU is currently in the idle loop, and, for NO_HZ_FULL, -also whether that CPU is executing in usermode, as discussed -earlier. -It also requires that the scheduling-clock interrupt be enabled when -RCU needs it to be: - -

    -
  1. If a CPU is either idle or executing in usermode, and RCU believes - it is non-idle, the scheduling-clock tick had better be running. - Otherwise, you will get RCU CPU stall warnings. Or at best, - very long (11-second) grace periods, with a pointless IPI waking - the CPU from time to time. -
  2. If a CPU is in a portion of the kernel that executes RCU read-side - critical sections, and RCU believes this CPU to be idle, you will get - random memory corruption. DON'T DO THIS!!! - -
    This is one reason to test with lockdep, which will complain - about this sort of thing. -
  3. If a CPU is in a portion of the kernel that is absolutely - positively no-joking guaranteed to never execute any RCU read-side - critical sections, and RCU believes this CPU to to be idle, - no problem. This sort of thing is used by some architectures - for light-weight exception handlers, which can then avoid the - overhead of rcu_irq_enter() and rcu_irq_exit() - at exception entry and exit, respectively. - Some go further and avoid the entireties of irq_enter() - and irq_exit(). - -
    Just make very sure you are running some of your tests with - CONFIG_PROVE_RCU=y, just in case one of your code paths - was in fact joking about not doing RCU read-side critical sections. -
  4. If a CPU is executing in the kernel with the scheduling-clock - interrupt disabled and RCU believes this CPU to be non-idle, - and if the CPU goes idle (from an RCU perspective) every few - jiffies, no problem. It is usually OK for there to be the - occasional gap between idle periods of up to a second or so. - -
    If the gap grows too long, you get RCU CPU stall warnings. -
  5. If a CPU is either idle or executing in usermode, and RCU believes - it to be idle, of course no problem. -
  6. If a CPU is executing in the kernel, the kernel code - path is passing through quiescent states at a reasonable - frequency (preferably about once per few jiffies, but the - occasional excursion to a second or so is usually OK) and the - scheduling-clock interrupt is enabled, of course no problem. - -
    If the gap between a successive pair of quiescent states grows - too long, you get RCU CPU stall warnings. -
- - - - - - - - -
 
Quick Quiz:
- But what if my driver has a hardware interrupt handler - that can run for many seconds? - I cannot invoke schedule() from an hardware - interrupt handler, after all! -
Answer:
- One approach is to do rcu_irq_exit();rcu_irq_enter(); - every so often. - But given that long-running interrupt handlers can cause - other problems, not least for response time, shouldn't you - work to keep your interrupt handler's runtime within reasonable - bounds? -
 
- -

-But as long as RCU is properly informed of kernel state transitions between -in-kernel execution, usermode execution, and idle, and as long as the -scheduling-clock interrupt is enabled when RCU needs it to be, you -can rest assured that the bugs you encounter will be in some other -part of RCU or some other part of the kernel! - -

Memory Efficiency

- -

-Although small-memory non-realtime systems can simply use Tiny RCU, -code size is only one aspect of memory efficiency. -Another aspect is the size of the rcu_head structure -used by call_rcu() and kfree_rcu(). -Although this structure contains nothing more than a pair of pointers, -it does appear in many RCU-protected data structures, including -some that are size critical. -The page structure is a case in point, as evidenced by -the many occurrences of the union keyword within that structure. - -

-This need for memory efficiency is one reason that RCU uses hand-crafted -singly linked lists to track the rcu_head structures that -are waiting for a grace period to elapse. -It is also the reason why rcu_head structures do not contain -debug information, such as fields tracking the file and line of the -call_rcu() or kfree_rcu() that posted them. -Although this information might appear in debug-only kernel builds at some -point, in the meantime, the ->func field will often provide -the needed debug information. - -

-However, in some cases, the need for memory efficiency leads to even -more extreme measures. -Returning to the page structure, the rcu_head field -shares storage with a great many other structures that are used at -various points in the corresponding page's lifetime. -In order to correctly resolve certain -race conditions, -the Linux kernel's memory-management subsystem needs a particular bit -to remain zero during all phases of grace-period processing, -and that bit happens to map to the bottom bit of the -rcu_head structure's ->next field. -RCU makes this guarantee as long as call_rcu() -is used to post the callback, as opposed to kfree_rcu() -or some future “lazy” -variant of call_rcu() that might one day be created for -energy-efficiency purposes. - -

-That said, there are limits. -RCU requires that the rcu_head structure be aligned to a -two-byte boundary, and passing a misaligned rcu_head -structure to one of the call_rcu() family of functions -will result in a splat. -It is therefore necessary to exercise caution when packing -structures containing fields of type rcu_head. -Why not a four-byte or even eight-byte alignment requirement? -Because the m68k architecture provides only two-byte alignment, -and thus acts as alignment's least common denominator. - -

-The reason for reserving the bottom bit of pointers to -rcu_head structures is to leave the door open to -“lazy” callbacks whose invocations can safely be deferred. -Deferring invocation could potentially have energy-efficiency -benefits, but only if the rate of non-lazy callbacks decreases -significantly for some important workload. -In the meantime, reserving the bottom bit keeps this option open -in case it one day becomes useful. - -

-Performance, Scalability, Response Time, and Reliability

- -

-Expanding on the -earlier discussion, -RCU is used heavily by hot code paths in performance-critical -portions of the Linux kernel's networking, security, virtualization, -and scheduling code paths. -RCU must therefore use efficient implementations, especially in its -read-side primitives. -To that end, it would be good if preemptible RCU's implementation -of rcu_read_lock() could be inlined, however, doing -this requires resolving #include issues with the -task_struct structure. - -

-The Linux kernel supports hardware configurations with up to -4096 CPUs, which means that RCU must be extremely scalable. -Algorithms that involve frequent acquisitions of global locks or -frequent atomic operations on global variables simply cannot be -tolerated within the RCU implementation. -RCU therefore makes heavy use of a combining tree based on the -rcu_node structure. -RCU is required to tolerate all CPUs continuously invoking any -combination of RCU's runtime primitives with minimal per-operation -overhead. -In fact, in many cases, increasing load must decrease the -per-operation overhead, witness the batching optimizations for -synchronize_rcu(), call_rcu(), -synchronize_rcu_expedited(), and rcu_barrier(). -As a general rule, RCU must cheerfully accept whatever the -rest of the Linux kernel decides to throw at it. - -

-The Linux kernel is used for real-time workloads, especially -in conjunction with the --rt patchset. -The real-time-latency response requirements are such that the -traditional approach of disabling preemption across RCU -read-side critical sections is inappropriate. -Kernels built with CONFIG_PREEMPT=y therefore -use an RCU implementation that allows RCU read-side critical -sections to be preempted. -This requirement made its presence known after users made it -clear that an earlier -real-time patch -did not meet their needs, in conjunction with some -RCU issues -encountered by a very early version of the -rt patchset. - -

-In addition, RCU must make do with a sub-100-microsecond real-time latency -budget. -In fact, on smaller systems with the -rt patchset, the Linux kernel -provides sub-20-microsecond real-time latencies for the whole kernel, -including RCU. -RCU's scalability and latency must therefore be sufficient for -these sorts of configurations. -To my surprise, the sub-100-microsecond real-time latency budget - -applies to even the largest systems [PDF], -up to and including systems with 4096 CPUs. -This real-time requirement motivated the grace-period kthread, which -also simplified handling of a number of race conditions. - -

-RCU must avoid degrading real-time response for CPU-bound threads, whether -executing in usermode (which is one use case for -CONFIG_NO_HZ_FULL=y) or in the kernel. -That said, CPU-bound loops in the kernel must execute -cond_resched() at least once per few tens of milliseconds -in order to avoid receiving an IPI from RCU. - -

-Finally, RCU's status as a synchronization primitive means that -any RCU failure can result in arbitrary memory corruption that can be -extremely difficult to debug. -This means that RCU must be extremely reliable, which in -practice also means that RCU must have an aggressive stress-test -suite. -This stress-test suite is called rcutorture. - -

-Although the need for rcutorture was no surprise, -the current immense popularity of the Linux kernel is posing -interesting—and perhaps unprecedented—validation -challenges. -To see this, keep in mind that there are well over one billion -instances of the Linux kernel running today, given Android -smartphones, Linux-powered televisions, and servers. -This number can be expected to increase sharply with the advent of -the celebrated Internet of Things. - -

-Suppose that RCU contains a race condition that manifests on average -once per million years of runtime. -This bug will be occurring about three times per day across -the installed base. -RCU could simply hide behind hardware error rates, given that no one -should really expect their smartphone to last for a million years. -However, anyone taking too much comfort from this thought should -consider the fact that in most jurisdictions, a successful multi-year -test of a given mechanism, which might include a Linux kernel, -suffices for a number of types of safety-critical certifications. -In fact, rumor has it that the Linux kernel is already being used -in production for safety-critical applications. -I don't know about you, but I would feel quite bad if a bug in RCU -killed someone. -Which might explain my recent focus on validation and verification. - -

Other RCU Flavors

- -

-One of the more surprising things about RCU is that there are now -no fewer than five flavors, or API families. -In addition, the primary flavor that has been the sole focus up to -this point has two different implementations, non-preemptible and -preemptible. -The other four flavors are listed below, with requirements for each -described in a separate section. - -

    -
  1. Bottom-Half Flavor (Historical) -
  2. Sched Flavor (Historical) -
  3. Sleepable RCU -
  4. Tasks RCU -
- -

Bottom-Half Flavor (Historical)

- -

-The RCU-bh flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable softirq and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -

-The softirq-disable (AKA “bottom-half”, -hence the “_bh” abbreviations) -flavor of RCU, or RCU-bh, was developed by -Dipankar Sarma to provide a flavor of RCU that could withstand the -network-based denial-of-service attacks researched by Robert -Olsson. -These attacks placed so much networking load on the system -that some of the CPUs never exited softirq execution, -which in turn prevented those CPUs from ever executing a context switch, -which, in the RCU implementation of that time, prevented grace periods -from ever ending. -The result was an out-of-memory condition and a system hang. - -

-The solution was the creation of RCU-bh, which does -local_bh_disable() -across its read-side critical sections, and which uses the transition -from one type of softirq processing to another as a quiescent state -in addition to context switch, idle, user mode, and offline. -This means that RCU-bh grace periods can complete even when some of -the CPUs execute in softirq indefinitely, thus allowing algorithms -based on RCU-bh to withstand network-based denial-of-service attacks. - -

-Because -rcu_read_lock_bh() and rcu_read_unlock_bh() -disable and re-enable softirq handlers, any attempt to start a softirq -handlers during the -RCU-bh read-side critical section will be deferred. -In this case, rcu_read_unlock_bh() -will invoke softirq processing, which can take considerable time. -One can of course argue that this softirq overhead should be associated -with the code following the RCU-bh read-side critical section rather -than rcu_read_unlock_bh(), but the fact -is that most profiling tools cannot be expected to make this sort -of fine distinction. -For example, suppose that a three-millisecond-long RCU-bh read-side -critical section executes during a time of heavy networking load. -There will very likely be an attempt to invoke at least one softirq -handler during that three milliseconds, but any such invocation will -be delayed until the time of the rcu_read_unlock_bh(). -This can of course make it appear at first glance as if -rcu_read_unlock_bh() was executing very slowly. - -

-The -RCU-bh API -includes -rcu_read_lock_bh(), -rcu_read_unlock_bh(), -rcu_dereference_bh(), -rcu_dereference_bh_check(), -synchronize_rcu_bh(), -synchronize_rcu_bh_expedited(), -call_rcu_bh(), -rcu_barrier_bh(), and -rcu_read_lock_bh_held(). -However, the update-side APIs are now simple wrappers for other RCU -flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt -otherwise. - -

Sched Flavor (Historical)

- -

-The RCU-sched flavor of RCU has since been expressed in terms of -the other RCU flavors as part of a consolidation of the three -flavors into a single flavor. -The read-side API remains, and continues to disable preemption and to -be accounted for by lockdep. -Much of the material in this section is therefore strictly historical -in nature. - -

-Before preemptible RCU, waiting for an RCU grace period had the -side effect of also waiting for all pre-existing interrupt -and NMI handlers. -However, there are legitimate preemptible-RCU implementations that -do not have this property, given that any point in the code outside -of an RCU read-side critical section can be a quiescent state. -Therefore, RCU-sched was created, which follows “classic” -RCU in that an RCU-sched grace period waits for for pre-existing -interrupt and NMI handlers. -In kernels built with CONFIG_PREEMPT=n, the RCU and RCU-sched -APIs have identical implementations, while kernels built with -CONFIG_PREEMPT=y provide a separate implementation for each. - -

-Note well that in CONFIG_PREEMPT=y kernels, -rcu_read_lock_sched() and rcu_read_unlock_sched() -disable and re-enable preemption, respectively. -This means that if there was a preemption attempt during the -RCU-sched read-side critical section, rcu_read_unlock_sched() -will enter the scheduler, with all the latency and overhead entailed. -Just as with rcu_read_unlock_bh(), this can make it look -as if rcu_read_unlock_sched() was executing very slowly. -However, the highest-priority task won't be preempted, so that task -will enjoy low-overhead rcu_read_unlock_sched() invocations. - -

-The -RCU-sched API -includes -rcu_read_lock_sched(), -rcu_read_unlock_sched(), -rcu_read_lock_sched_notrace(), -rcu_read_unlock_sched_notrace(), -rcu_dereference_sched(), -rcu_dereference_sched_check(), -synchronize_sched(), -synchronize_rcu_sched_expedited(), -call_rcu_sched(), -rcu_barrier_sched(), and -rcu_read_lock_sched_held(). -However, anything that disables preemption also marks an RCU-sched -read-side critical section, including -preempt_disable() and preempt_enable(), -local_irq_save() and local_irq_restore(), -and so on. - -

Sleepable RCU

- -

-For well over a decade, someone saying “I need to block within -an RCU read-side critical section” was a reliable indication -that this someone did not understand RCU. -After all, if you are always blocking in an RCU read-side critical -section, you can probably afford to use a higher-overhead synchronization -mechanism. -However, that changed with the advent of the Linux kernel's notifiers, -whose RCU read-side critical -sections almost never sleep, but sometimes need to. -This resulted in the introduction of -sleepable RCU, -or SRCU. - -

-SRCU allows different domains to be defined, with each such domain -defined by an instance of an srcu_struct structure. -A pointer to this structure must be passed in to each SRCU function, -for example, synchronize_srcu(&ss), where -ss is the srcu_struct structure. -The key benefit of these domains is that a slow SRCU reader in one -domain does not delay an SRCU grace period in some other domain. -That said, one consequence of these domains is that read-side code -must pass a “cookie” from srcu_read_lock() -to srcu_read_unlock(), for example, as follows: - -

-
- 1 int idx;
- 2
- 3 idx = srcu_read_lock(&ss);
- 4 do_something();
- 5 srcu_read_unlock(&ss, idx);
-
-
- -

-As noted above, it is legal to block within SRCU read-side critical sections, -however, with great power comes great responsibility. -If you block forever in one of a given domain's SRCU read-side critical -sections, then that domain's grace periods will also be blocked forever. -Of course, one good way to block forever is to deadlock, which can -happen if any operation in a given domain's SRCU read-side critical -section can wait, either directly or indirectly, for that domain's -grace period to elapse. -For example, this results in a self-deadlock: - -

-
- 1 int idx;
- 2
- 3 idx = srcu_read_lock(&ss);
- 4 do_something();
- 5 synchronize_srcu(&ss);
- 6 srcu_read_unlock(&ss, idx);
-
-
- -

-However, if line 5 acquired a mutex that was held across -a synchronize_srcu() for domain ss, -deadlock would still be possible. -Furthermore, if line 5 acquired a mutex that was held across -a synchronize_srcu() for some other domain ss1, -and if an ss1-domain SRCU read-side critical section -acquired another mutex that was held across as ss-domain -synchronize_srcu(), -deadlock would again be possible. -Such a deadlock cycle could extend across an arbitrarily large number -of different SRCU domains. -Again, with great power comes great responsibility. - -

-Unlike the other RCU flavors, SRCU read-side critical sections can -run on idle and even offline CPUs. -This ability requires that srcu_read_lock() and -srcu_read_unlock() contain memory barriers, which means -that SRCU readers will run a bit slower than would RCU readers. -It also motivates the smp_mb__after_srcu_read_unlock() -API, which, in combination with srcu_read_unlock(), -guarantees a full memory barrier. - -

-Also unlike other RCU flavors, synchronize_srcu() may not -be invoked from CPU-hotplug notifiers, due to the fact that SRCU grace -periods make use of timers and the possibility of timers being temporarily -“stranded” on the outgoing CPU. -This stranding of timers means that timers posted to the outgoing CPU -will not fire until late in the CPU-hotplug process. -The problem is that if a notifier is waiting on an SRCU grace period, -that grace period is waiting on a timer, and that timer is stranded on the -outgoing CPU, then the notifier will never be awakened, in other words, -deadlock has occurred. -This same situation of course also prohibits srcu_barrier() -from being invoked from CPU-hotplug notifiers. - -

-SRCU also differs from other RCU flavors in that SRCU's expedited and -non-expedited grace periods are implemented by the same mechanism. -This means that in the current SRCU implementation, expediting a -future grace period has the side effect of expediting all prior -grace periods that have not yet completed. -(But please note that this is a property of the current implementation, -not necessarily of future implementations.) -In addition, if SRCU has been idle for longer than the interval -specified by the srcutree.exp_holdoff kernel boot parameter -(25 microseconds by default), -and if a synchronize_srcu() invocation ends this idle period, -that invocation will be automatically expedited. - -

-As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating -a locking bottleneck present in prior kernel versions. -Although this will allow users to put much heavier stress on -call_srcu(), it is important to note that SRCU does not -yet take any special steps to deal with callback flooding. -So if you are posting (say) 10,000 SRCU callbacks per second per CPU, -you are probably totally OK, but if you intend to post (say) 1,000,000 -SRCU callbacks per second per CPU, please run some tests first. -SRCU just might need a few adjustment to deal with that sort of load. -Of course, your mileage may vary based on the speed of your CPUs and -the size of your memory. - -

-The -SRCU API -includes -srcu_read_lock(), -srcu_read_unlock(), -srcu_dereference(), -srcu_dereference_check(), -synchronize_srcu(), -synchronize_srcu_expedited(), -call_srcu(), -srcu_barrier(), and -srcu_read_lock_held(). -It also includes -DEFINE_SRCU(), -DEFINE_STATIC_SRCU(), and -init_srcu_struct() -APIs for defining and initializing srcu_struct structures. - -

Tasks RCU

- -

-Some forms of tracing use “trampolines” to handle the -binary rewriting required to install different types of probes. -It would be good to be able to free old trampolines, which sounds -like a job for some form of RCU. -However, because it is necessary to be able to install a trace -anywhere in the code, it is not possible to use read-side markers -such as rcu_read_lock() and rcu_read_unlock(). -In addition, it does not work to have these markers in the trampoline -itself, because there would need to be instructions following -rcu_read_unlock(). -Although synchronize_rcu() would guarantee that execution -reached the rcu_read_unlock(), it would not be able to -guarantee that execution had completely left the trampoline. - -

-The solution, in the form of -Tasks RCU, -is to have implicit -read-side critical sections that are delimited by voluntary context -switches, that is, calls to schedule(), -cond_resched(), and -synchronize_rcu_tasks(). -In addition, transitions to and from userspace execution also delimit -tasks-RCU read-side critical sections. - -

-The tasks-RCU API is quite compact, consisting only of -call_rcu_tasks(), -synchronize_rcu_tasks(), and -rcu_barrier_tasks(). -In CONFIG_PREEMPT=n kernels, trampolines cannot be preempted, -so these APIs map to -call_rcu(), -synchronize_rcu(), and -rcu_barrier(), respectively. -In CONFIG_PREEMPT=y kernels, trampolines can be preempted, -and these three APIs are therefore implemented by separate functions -that check for voluntary context switches. - -

Possible Future Changes

- -

-One of the tricks that RCU uses to attain update-side scalability is -to increase grace-period latency with increasing numbers of CPUs. -If this becomes a serious problem, it will be necessary to rework the -grace-period state machine so as to avoid the need for the additional -latency. - -

-RCU disables CPU hotplug in a few places, perhaps most notably in the -rcu_barrier() operations. -If there is a strong reason to use rcu_barrier() in CPU-hotplug -notifiers, it will be necessary to avoid disabling CPU hotplug. -This would introduce some complexity, so there had better be a very -good reason. - -

-The tradeoff between grace-period latency on the one hand and interruptions -of other CPUs on the other hand may need to be re-examined. -The desire is of course for zero grace-period latency as well as zero -interprocessor interrupts undertaken during an expedited grace period -operation. -While this ideal is unlikely to be achievable, it is quite possible that -further improvements can be made. - -

-The multiprocessor implementations of RCU use a combining tree that -groups CPUs so as to reduce lock contention and increase cache locality. -However, this combining tree does not spread its memory across NUMA -nodes nor does it align the CPU groups with hardware features such -as sockets or cores. -Such spreading and alignment is currently believed to be unnecessary -because the hotpath read-side primitives do not access the combining -tree, nor does call_rcu() in the common case. -If you believe that your architecture needs such spreading and alignment, -then your architecture should also benefit from the -rcutree.rcu_fanout_leaf boot parameter, which can be set -to the number of CPUs in a socket, NUMA node, or whatever. -If the number of CPUs is too large, use a fraction of the number of -CPUs. -If the number of CPUs is a large prime number, well, that certainly -is an “interesting” architectural choice! -More flexible arrangements might be considered, but only if -rcutree.rcu_fanout_leaf has proven inadequate, and only -if the inadequacy has been demonstrated by a carefully run and -realistic system-level workload. - -

-Please note that arrangements that require RCU to remap CPU numbers will -require extremely good demonstration of need and full exploration of -alternatives. - -

-RCU's various kthreads are reasonably recent additions. -It is quite likely that adjustments will be required to more gracefully -handle extreme loads. -It might also be necessary to be able to relate CPU utilization by -RCU's kthreads and softirq handlers to the code that instigated this -CPU utilization. -For example, RCU callback overhead might be charged back to the -originating call_rcu() instance, though probably not -in production kernels. - -

-Additional work may be required to provide reasonable forward-progress -guarantees under heavy load for grace periods and for callback -invocation. - -

Summary

- -

-This document has presented more than two decade's worth of RCU -requirements. -Given that the requirements keep changing, this will not be the last -word on this subject, but at least it serves to get an important -subset of the requirements set forth. - -

Acknowledgments

- -I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, -Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and -Andy Lutomirski for their help in rendering -this article human readable, and to Michelle Rankin for her support -of this effort. -Other contributions are acknowledged in the Linux kernel's git archive. - - diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst new file mode 100644 index 0000000000000000000000000000000000000000..876e0038bb58ff01e1557ca9d149045677f7fd29 --- /dev/null +++ b/Documentation/RCU/Design/Requirements/Requirements.rst @@ -0,0 +1,2662 @@ +================================= +A Tour Through RCU's Requirements +================================= + +Copyright IBM Corporation, 2015 + +Author: Paul E. McKenney + +The initial version of this document appeared in the +`LWN `_ on those articles: +`part 1 `_, +`part 2 `_, and +`part 3 `_. + +Introduction +------------ + +Read-copy update (RCU) is a synchronization mechanism that is often used +as a replacement for reader-writer locking. RCU is unusual in that +updaters do not block readers, which means that RCU's read-side +primitives can be exceedingly fast and scalable. In addition, updaters +can make useful forward progress concurrently with readers. However, all +this concurrency between RCU readers and updaters does raise the +question of exactly what RCU readers are doing, which in turn raises the +question of exactly what RCU's requirements are. + +This document therefore summarizes RCU's requirements, and can be +thought of as an informal, high-level specification for RCU. It is +important to understand that RCU's specification is primarily empirical +in nature; in fact, I learned about many of these requirements the hard +way. This situation might cause some consternation, however, not only +has this learning process been a lot of fun, but it has also been a +great privilege to work with so many people willing to apply +technologies in interesting new ways. + +All that aside, here are the categories of currently known RCU +requirements: + +#. `Fundamental Requirements <#Fundamental%20Requirements>`__ +#. `Fundamental Non-Requirements <#Fundamental%20Non-Requirements>`__ +#. `Parallelism Facts of Life <#Parallelism%20Facts%20of%20Life>`__ +#. `Quality-of-Implementation + Requirements <#Quality-of-Implementation%20Requirements>`__ +#. `Linux Kernel Complications <#Linux%20Kernel%20Complications>`__ +#. `Software-Engineering + Requirements <#Software-Engineering%20Requirements>`__ +#. `Other RCU Flavors <#Other%20RCU%20Flavors>`__ +#. `Possible Future Changes <#Possible%20Future%20Changes>`__ + +This is followed by a `summary <#Summary>`__, however, the answers to +each quick quiz immediately follows the quiz. Select the big white space +with your mouse to see the answer. + +Fundamental Requirements +------------------------ + +RCU's fundamental requirements are the closest thing RCU has to hard +mathematical requirements. These are: + +#. `Grace-Period Guarantee <#Grace-Period%20Guarantee>`__ +#. `Publish-Subscribe Guarantee <#Publish-Subscribe%20Guarantee>`__ +#. `Memory-Barrier Guarantees <#Memory-Barrier%20Guarantees>`__ +#. `RCU Primitives Guaranteed to Execute + Unconditionally <#RCU%20Primitives%20Guaranteed%20to%20Execute%20Unconditionally>`__ +#. `Guaranteed Read-to-Write + Upgrade <#Guaranteed%20Read-to-Write%20Upgrade>`__ + +Grace-Period Guarantee +~~~~~~~~~~~~~~~~~~~~~~ + +RCU's grace-period guarantee is unusual in being premeditated: Jack +Slingwine and I had this guarantee firmly in mind when we started work +on RCU (then called “rclock”) in the early 1990s. That said, the past +two decades of experience with RCU have produced a much more detailed +understanding of this guarantee. + +RCU's grace-period guarantee allows updaters to wait for the completion +of all pre-existing RCU read-side critical sections. An RCU read-side +critical section begins with the marker ``rcu_read_lock()`` and ends +with the marker ``rcu_read_unlock()``. These markers may be nested, and +RCU treats a nested set as one big RCU read-side critical section. +Production-quality implementations of ``rcu_read_lock()`` and +``rcu_read_unlock()`` are extremely lightweight, and in fact have +exactly zero overhead in Linux kernels built for production use with +``CONFIG_PREEMPT=n``. + +This guarantee allows ordering to be enforced with extremely low +overhead to readers, for example: + + :: + + 1 int x, y; + 2 + 3 void thread0(void) + 4 { + 5 rcu_read_lock(); + 6 r1 = READ_ONCE(x); + 7 r2 = READ_ONCE(y); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 WRITE_ONCE(x, 1); + 14 synchronize_rcu(); + 15 WRITE_ONCE(y, 1); + 16 } + +Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing +readers, any instance of ``thread0()`` that loads a value of zero from +``x`` must complete before ``thread1()`` stores to ``y``, so that +instance must also load a value of zero from ``y``. Similarly, any +instance of ``thread0()`` that loads a value of one from ``y`` must have +started after the ``synchronize_rcu()`` started, and must therefore also +load a value of one from ``x``. Therefore, the outcome: + + :: + + (r1 == 0 && r2 == 1) + +cannot happen. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! You said that updaters can make useful forward | +| progress concurrently with readers, but pre-existing readers will | +| block ``synchronize_rcu()``!!! | +| Just who are you trying to fool??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| First, if updaters do not wish to be blocked by readers, they can use | +| ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | +| Second, even when using ``synchronize_rcu()``, the other update-side | +| code does run concurrently with readers, whether pre-existing or not. | ++-----------------------------------------------------------------------+ + +This scenario resembles one of the first uses of RCU in +`DYNIX/ptx `__, which managed a +distributed lock manager's transition into a state suitable for handling +recovery from node failure, more or less as follows: + + :: + + 1 #define STATE_NORMAL 0 + 2 #define STATE_WANT_RECOVERY 1 + 3 #define STATE_RECOVERING 2 + 4 #define STATE_WANT_NORMAL 3 + 5 + 6 int state = STATE_NORMAL; + 7 + 8 void do_something_dlm(void) + 9 { + 10 int state_snap; + 11 + 12 rcu_read_lock(); + 13 state_snap = READ_ONCE(state); + 14 if (state_snap == STATE_NORMAL) + 15 do_something(); + 16 else + 17 do_something_carefully(); + 18 rcu_read_unlock(); + 19 } + 20 + 21 void start_recovery(void) + 22 { + 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); + 24 synchronize_rcu(); + 25 WRITE_ONCE(state, STATE_RECOVERING); + 26 recovery(); + 27 WRITE_ONCE(state, STATE_WANT_NORMAL); + 28 synchronize_rcu(); + 29 WRITE_ONCE(state, STATE_NORMAL); + 30 } + +The RCU read-side critical section in ``do_something_dlm()`` works with +the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that +``do_something()`` never runs concurrently with ``recovery()``, but with +little or no synchronization overhead in ``do_something_dlm()``. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why is the ``synchronize_rcu()`` on line 28 needed? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Without that extra grace period, memory reordering could result in | +| ``do_something_dlm()`` executing ``do_something()`` concurrently with | +| the last bits of ``recovery()``. | ++-----------------------------------------------------------------------+ + +In order to avoid fatal problems such as deadlocks, an RCU read-side +critical section must not contain calls to ``synchronize_rcu()``. +Similarly, an RCU read-side critical section must not contain anything +that waits, directly or indirectly, on completion of an invocation of +``synchronize_rcu()``. + +Although RCU's grace-period guarantee is useful in and of itself, with +`quite a few use cases `__, it would +be good to be able to use RCU to coordinate read-side access to linked +data structures. For this, the grace-period guarantee is not sufficient, +as can be seen in function ``add_gp_buggy()`` below. We will look at the +reader's code later, but in the meantime, just think of the reader as +locklessly picking up the ``gp`` pointer, and, if the value loaded is +non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. + + :: + + 1 bool add_gp_buggy(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 gp = p; /* ORDERING BUG */ + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The problem is that both the compiler and weakly ordered CPUs are within +their rights to reorder this code as follows: + + :: + + 1 bool add_gp_buggy_optimized(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 gp = p; /* ORDERING BUG */ + 12 p->a = a; + 13 p->b = a; + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` +executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. +And this is but one of many ways in which compiler and hardware +optimizations could cause trouble. Therefore, we clearly need some way +to prevent the compiler and the CPU from reordering in this manner, +which brings us to the publish-subscribe guarantee discussed in the next +section. + +Publish/Subscribe Guarantee +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +RCU's publish-subscribe guarantee allows data to be inserted into a +linked data structure without disrupting RCU readers. The updater uses +``rcu_assign_pointer()`` to insert the new data, and readers use +``rcu_dereference()`` to access data, whether new or old. The following +shows an example of insertion: + + :: + + 1 bool add_gp(int a, int b) + 2 { + 3 p = kmalloc(sizeof(*p), GFP_KERNEL); + 4 if (!p) + 5 return -ENOMEM; + 6 spin_lock(&gp_lock); + 7 if (rcu_access_pointer(gp)) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 p->a = a; + 12 p->b = a; + 13 rcu_assign_pointer(gp, p); + 14 spin_unlock(&gp_lock); + 15 return true; + 16 } + +The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a +simple assignment statement, but also guarantees that its assignment +will happen after the two assignments in lines 11 and 12, similar to the +C11 ``memory_order_release`` store operation. It also prevents any +number of “interesting” compiler optimizations, for example, the use of +``gp`` as a scratch location immediately preceding the assignment. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But ``rcu_assign_pointer()`` does nothing to prevent the two | +| assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | +| also cause problems? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, it cannot. The readers cannot see either of these two fields | +| until the assignment to ``gp``, by which time both fields are fully | +| initialized. So reordering the assignments to ``p->a`` and ``p->b`` | +| cannot possibly cause any problems. | ++-----------------------------------------------------------------------+ + +It is tempting to assume that the reader need not do anything special to +control its accesses to the RCU-protected data, as shown in +``do_something_gp_buggy()`` below: + + :: + + 1 bool do_something_gp_buggy(void) + 2 { + 3 rcu_read_lock(); + 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +However, this temptation must be resisted because there are a +surprisingly large number of ways that the compiler (to say nothing of +`DEC Alpha CPUs `__) +can trip this code up. For but one example, if the compiler were short +of registers, it might choose to refetch from ``gp`` rather than keeping +a separate copy in ``p`` as follows: + + :: + + 1 bool do_something_gp_buggy_optimized(void) + 2 { + 3 rcu_read_lock(); + 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ + 5 do_something(gp->a, gp->b); + 6 rcu_read_unlock(); + 7 return true; + 8 } + 9 rcu_read_unlock(); + 10 return false; + 11 } + +If this function ran concurrently with a series of updates that replaced +the current structure with a new one, the fetches of ``gp->a`` and +``gp->b`` might well come from two different structures, which could +cause serious confusion. To prevent this (and much else besides), +``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: + + :: + + 1 bool do_something_gp(void) + 2 { + 3 rcu_read_lock(); + 4 p = rcu_dereference(gp); + 5 if (p) { + 6 do_something(p->a, p->b); + 7 rcu_read_unlock(); + 8 return true; + 9 } + 10 rcu_read_unlock(); + 11 return false; + 12 } + +The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory +barriers in the Linux kernel. Should a `high-quality implementation of +C11 ``memory_order_consume`` +[PDF] `__ +ever appear, then ``rcu_dereference()`` could be implemented as a +``memory_order_consume`` load. Regardless of the exact implementation, a +pointer fetched by ``rcu_dereference()`` may not be used outside of the +outermost RCU read-side critical section containing that +``rcu_dereference()``, unless protection of the corresponding data +element has been passed from RCU to some other synchronization +mechanism, most commonly locking or `reference +counting `__. + +In short, updaters use ``rcu_assign_pointer()`` and readers use +``rcu_dereference()``, and these two RCU API elements work together to +ensure that readers have a consistent view of newly added data elements. + +Of course, it is also necessary to remove elements from RCU-protected +data structures, for example, using the following process: + +#. Remove the data element from the enclosing structure. +#. Wait for all pre-existing RCU read-side critical sections to complete + (because only pre-existing readers can possibly have a reference to + the newly removed data element). +#. At this point, only the updater has a reference to the newly removed + data element, so it can safely reclaim the data element, for example, + by passing it to ``kfree()``. + +This process is implemented by ``remove_gp_synchronous()``: + + :: + + 1 bool remove_gp_synchronous(void) + 2 { + 3 struct foo *p; + 4 + 5 spin_lock(&gp_lock); + 6 p = rcu_access_pointer(gp); + 7 if (!p) { + 8 spin_unlock(&gp_lock); + 9 return false; + 10 } + 11 rcu_assign_pointer(gp, NULL); + 12 spin_unlock(&gp_lock); + 13 synchronize_rcu(); + 14 kfree(p); + 15 return true; + 16 } + +This function is straightforward, with line 13 waiting for a grace +period before line 14 frees the old data element. This waiting ensures +that readers will reach line 7 of ``do_something_gp()`` before the data +element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on +line 6 is similar to ``rcu_dereference()``, except that: + +#. The value returned by ``rcu_access_pointer()`` cannot be + dereferenced. If you want to access the value pointed to as well as + the pointer itself, use ``rcu_dereference()`` instead of + ``rcu_access_pointer()``. +#. The call to ``rcu_access_pointer()`` need not be protected. In + contrast, ``rcu_dereference()`` must either be within an RCU + read-side critical section or in a code segment where the pointer + cannot change, for example, in code protected by the corresponding + update-side lock. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | +| what destructive optimizations might the compiler make use of? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Let's start with what happens to ``do_something_gp()`` if it fails to | +| use ``rcu_dereference()``. It could reuse a value formerly fetched | +| from this same pointer. It could also fetch the pointer from ``gp`` | +| in a byte-at-a-time manner, resulting in *load tearing*, in turn | +| resulting a bytewise mash-up of two distinct pointer values. It might | +| even use value-speculation optimizations, where it makes a wrong | +| guess, but by the time it gets around to checking the value, an | +| update has changed the pointer to match the wrong guess. Too bad | +| about any dereferences that returned pre-initialization garbage in | +| the meantime! | +| For ``remove_gp_synchronous()``, as long as all modifications to | +| ``gp`` are carried out while holding ``gp_lock``, the above | +| optimizations are harmless. However, ``sparse`` will complain if you | +| define ``gp`` with ``__rcu`` and then access it without using either | +| ``rcu_access_pointer()`` or ``rcu_dereference()``. | ++-----------------------------------------------------------------------+ + +In short, RCU's publish-subscribe guarantee is provided by the +combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This +guarantee allows data elements to be safely added to RCU-protected +linked data structures without disrupting RCU readers. This guarantee +can be used in combination with the grace-period guarantee to also allow +data elements to be removed from RCU-protected linked data structures, +again without disrupting RCU readers. + +This guarantee was only partially premeditated. DYNIX/ptx used an +explicit memory barrier for publication, but had nothing resembling +``rcu_dereference()`` for subscription, nor did it have anything +resembling the ``smp_read_barrier_depends()`` that was later subsumed +into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The +need for these operations made itself known quite suddenly at a +late-1990s meeting with the DEC Alpha architects, back in the days when +DEC was still a free-standing company. It took the Alpha architects a +good hour to convince me that any sort of barrier would ever be needed, +and it then took me a good *two* hours to convince them that their +documentation did not make this point clear. More recent work with the C +and C++ standards committees have provided much education on tricks and +traps from the compiler. In short, compilers were much less tricky in +the early 1990s, but in 2015, don't even think about omitting +``rcu_dereference()``! + +Memory-Barrier Guarantees +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The previous section's simple linked-data-structure scenario clearly +demonstrates the need for RCU's stringent memory-ordering guarantees on +systems with more than one CPU: + +#. Each CPU that has an RCU read-side critical section that begins + before ``synchronize_rcu()`` starts is guaranteed to execute a full + memory barrier between the time that the RCU read-side critical + section ends and the time that ``synchronize_rcu()`` returns. Without + this guarantee, a pre-existing RCU read-side critical section might + hold a reference to the newly removed ``struct foo`` after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()``. +#. Each CPU that has an RCU read-side critical section that ends after + ``synchronize_rcu()`` returns is guaranteed to execute a full memory + barrier between the time that ``synchronize_rcu()`` begins and the + time that the RCU read-side critical section begins. Without this + guarantee, a later RCU read-side critical section running after the + ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run + ``do_something_gp()`` and find the newly deleted ``struct foo``. +#. If the task invoking ``synchronize_rcu()`` remains on a given CPU, + then that CPU is guaranteed to execute a full memory barrier sometime + during the execution of ``synchronize_rcu()``. This guarantee ensures + that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really + does execute after the removal on line 11. +#. If the task invoking ``synchronize_rcu()`` migrates among a group of + CPUs during that invocation, then each of the CPUs in that group is + guaranteed to execute a full memory barrier sometime during the + execution of ``synchronize_rcu()``. This guarantee also ensures that + the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does + execute after the removal on line 11, but also in the case where the + thread executing the ``synchronize_rcu()`` migrates in the meantime. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Given that multiple CPUs can start RCU read-side critical sections at | +| any time without any ordering whatsoever, how can RCU possibly tell | +| whether or not a given RCU read-side critical section starts before a | +| given instance of ``synchronize_rcu()``? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| If RCU cannot tell whether or not a given RCU read-side critical | +| section starts before a given instance of ``synchronize_rcu()``, then | +| it must assume that the RCU read-side critical section started first. | +| In other words, a given instance of ``synchronize_rcu()`` can avoid | +| waiting on a given RCU read-side critical section only if it can | +| prove that ``synchronize_rcu()`` started first. | +| A related question is “When ``rcu_read_lock()`` doesn't generate any | +| code, why does it matter how it relates to a grace period?” The | +| answer is that it is not the relationship of ``rcu_read_lock()`` | +| itself that is important, but rather the relationship of the code | +| within the enclosed RCU read-side critical section to the code | +| preceding and following the grace period. If we take this viewpoint, | +| then a given RCU read-side critical section begins before a given | +| grace period when some access preceding the grace period observes the | +| effect of some access within the critical section, in which case none | +| of the accesses within the critical section may observe the effects | +| of any access following the grace period. | +| | +| As of late 2016, mathematical models of RCU take this viewpoint, for | +| example, see slides 62 and 63 of the `2016 LinuxCon | +| EU `__ | +| presentation. | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| The first and second guarantees require unbelievably strict ordering! | +| Are all these memory barriers *really* required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Yes, they really are required. To see why the first guarantee is | +| required, consider the following sequence of events: | +| | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``do_something_with(q->a);`` | +| ``/* No smp_mb(), so might happen after kfree(). */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| | +| Therefore, there absolutely must be a full memory barrier between the | +| end of the RCU read-side critical section and the end of the grace | +| period. | +| | +| The sequence of events demonstrating the necessity of the second rule | +| is roughly similar: | +| | +| #. CPU 0: ``list_del_rcu(p);`` | +| #. CPU 0: ``synchronize_rcu()`` starts. | +| #. CPU 1: ``rcu_read_lock()`` | +| #. CPU 1: ``q = rcu_dereference(gp);`` | +| ``/* Might return p if no memory barrier. */`` | +| #. CPU 0: ``synchronize_rcu()`` returns. | +| #. CPU 0: ``kfree(p);`` | +| #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | +| #. CPU 1: ``rcu_read_unlock()`` | +| | +| And similarly, without a memory barrier between the beginning of the | +| grace period and the beginning of the RCU read-side critical section, | +| CPU 1 might end up accessing the freelist. | +| | +| The “as if” rule of course applies, so that any implementation that | +| acts as if the appropriate memory barriers were in place is a correct | +| implementation. That said, it is much easier to fool yourself into | +| believing that you have adhered to the as-if rule than it is to | +| actually adhere to it! | ++-----------------------------------------------------------------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code in some kernel builds. This means that the | +| compiler might arbitrarily rearrange consecutive RCU read-side | +| critical sections. Given such rearrangement, if a given RCU read-side | +| critical section is done, how can you be sure that all prior RCU | +| read-side critical sections are done? Won't the compiler | +| rearrangements make that impossible to determine? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | +| absolutely no code, RCU infers quiescent states only at special | +| locations, for example, within the scheduler. Because calls to | +| ``schedule()`` had better prevent calling-code accesses to shared | +| variables from being rearranged across the call to ``schedule()``, if | +| RCU detects the end of a given RCU read-side critical section, it | +| will necessarily detect the end of all prior RCU read-side critical | +| sections, no matter how aggressively the compiler scrambles the code. | +| Again, this all assumes that the compiler cannot scramble code across | +| calls to the scheduler, out of interrupt handlers, into the idle | +| loop, into user-mode code, and so on. But if your kernel build allows | +| that sort of scrambling, you have broken far more than just RCU! | ++-----------------------------------------------------------------------+ + +Note that these memory-barrier requirements do not replace the +fundamental RCU requirement that a grace period wait for all +pre-existing readers. On the contrary, the memory barriers called out in +this section must operate in such a way as to *enforce* this fundamental +requirement. Of course, different implementations enforce this +requirement in different ways, but enforce it they must. + +RCU Primitives Guaranteed to Execute Unconditionally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The common-case RCU primitives are unconditional. They are invoked, they +do their job, and they return, with no possibility of error, and no need +to retry. This is a key RCU design philosophy. + +However, this philosophy is pragmatic rather than pigheaded. If someone +comes up with a good justification for a particular conditional RCU +primitive, it might well be implemented and added. After all, this +guarantee was reverse-engineered, not premeditated. The unconditional +nature of the RCU primitives was initially an accident of +implementation, and later experience with synchronization primitives +with conditional primitives caused me to elevate this accident to a +guarantee. Therefore, the justification for adding a conditional +primitive to RCU would need to be based on detailed and compelling use +cases. + +Guaranteed Read-to-Write Upgrade +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As far as RCU is concerned, it is always possible to carry out an update +within an RCU read-side critical section. For example, that RCU +read-side critical section might search for a given data element, and +then might acquire the update-side spinlock in order to update that +element, all while remaining in that RCU read-side critical section. Of +course, it is necessary to exit the RCU read-side critical section +before invoking ``synchronize_rcu()``, however, this inconvenience can +be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API +members described later in this document. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But how does the upgrade-to-write operation exclude other readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| It doesn't, just like normal RCU updates, which also do not exclude | +| RCU readers. | ++-----------------------------------------------------------------------+ + +This guarantee allows lookup code to be shared between read-side and +update-side code, and was premeditated, appearing in the earliest +DYNIX/ptx RCU documentation. + +Fundamental Non-Requirements +---------------------------- + +RCU provides extremely lightweight readers, and its read-side +guarantees, though quite useful, are correspondingly lightweight. It is +therefore all too easy to assume that RCU is guaranteeing more than it +really is. Of course, the list of things that RCU does not guarantee is +infinitely long, however, the following sections list a few +non-guarantees that have caused confusion. Except where otherwise noted, +these non-guarantees were premeditated. + +#. `Readers Impose Minimal + Ordering <#Readers%20Impose%20Minimal%20Ordering>`__ +#. `Readers Do Not Exclude + Updaters <#Readers%20Do%20Not%20Exclude%20Updaters>`__ +#. `Updaters Only Wait For Old + Readers <#Updaters%20Only%20Wait%20For%20Old%20Readers>`__ +#. `Grace Periods Don't Partition Read-Side Critical + Sections <#Grace%20Periods%20Don't%20Partition%20Read-Side%20Critical%20Sections>`__ +#. `Read-Side Critical Sections Don't Partition Grace + Periods <#Read-Side%20Critical%20Sections%20Don't%20Partition%20Grace%20Periods>`__ + +Readers Impose Minimal Ordering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Reader-side markers such as ``rcu_read_lock()`` and +``rcu_read_unlock()`` provide absolutely no ordering guarantees except +through their interaction with the grace-period APIs such as +``synchronize_rcu()``. To see this, consider the following pair of +threads: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(x, 1); + 5 rcu_read_unlock(); + 6 rcu_read_lock(); + 7 WRITE_ONCE(y, 1); + 8 rcu_read_unlock(); + 9 } + 10 + 11 void thread1(void) + 12 { + 13 rcu_read_lock(); + 14 r1 = READ_ONCE(y); + 15 rcu_read_unlock(); + 16 rcu_read_lock(); + 17 r2 = READ_ONCE(x); + 18 rcu_read_unlock(); + 19 } + +After ``thread0()`` and ``thread1()`` execute concurrently, it is quite +possible to have + + :: + + (r1 == 1 && r2 == 0) + +(that is, ``y`` appears to have been assigned before ``x``), which would +not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had +much in the way of ordering properties. But they do not, so the CPU is +within its rights to do significant reordering. This is by design: Any +significant ordering constraints would slow down these fast-path APIs. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Can't the compiler also reorder this code? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | +| prevent the compiler from reordering in this particular case. | ++-----------------------------------------------------------------------+ + +Readers Do Not Exclude Updaters +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. +All they do is to prevent grace periods from ending. The following +example illustrates this: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 r1 = READ_ONCE(y); + 5 if (r1) { + 6 do_something_with_nonzero_x(); + 7 r2 = READ_ONCE(x); + 8 WARN_ON(!r2); /* BUG!!! */ + 9 } + 10 rcu_read_unlock(); + 11 } + 12 + 13 void thread1(void) + 14 { + 15 spin_lock(&my_lock); + 16 WRITE_ONCE(x, 1); + 17 WRITE_ONCE(y, 1); + 18 spin_unlock(&my_lock); + 19 } + +If the ``thread0()`` function's ``rcu_read_lock()`` excluded the +``thread1()`` function's update, the ``WARN_ON()`` could never fire. But +the fact is that ``rcu_read_lock()`` does not exclude much of anything +aside from subsequent grace periods, of which ``thread1()`` has none, so +the ``WARN_ON()`` can and does fire. + +Updaters Only Wait For Old Readers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It might be tempting to assume that after ``synchronize_rcu()`` +completes, there are no readers executing. This temptation must be +avoided because new readers can start immediately after +``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no +obligation to wait for these new readers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Suppose that synchronize_rcu() did wait until *all* readers had | +| completed instead of waiting only on pre-existing readers. For how | +| long would the updater be able to rely on there being no readers? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| For no time at all. Even if ``synchronize_rcu()`` were to wait until | +| all readers had completed, a new reader might start immediately after | +| ``synchronize_rcu()`` completed. Therefore, the code following | +| ``synchronize_rcu()`` can *never* rely on there being no readers. | ++-----------------------------------------------------------------------+ + +Grace Periods Don't Partition Read-Side Critical Sections +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is tempting to assume that if any part of one RCU read-side critical +section precedes a given grace period, and if any part of another RCU +read-side critical section follows that same grace period, then all of +the first RCU read-side critical section must precede all of the second. +However, this just isn't the case: A single grace period does not +partition the set of RCU read-side critical sections. An example of this +situation can be illustrated as follows, where ``x``, ``y``, and ``z`` +are initially all zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 r2 = READ_ONCE(b); + 20 r3 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + +It turns out that the outcome: + + :: + + (r1 == 1 && r2 == 0 && r3 == 1) + +is entirely possible. The following figure show how this can happen, +with each circled ``QS`` indicating the point at which RCU recorded a +*quiescent state* for each thread, that is, a state in which RCU knows +that the thread cannot be in the midst of an RCU read-side critical +section that started before the current grace period: + +.. kernel-figure:: GPpartitionReaders1.svg + +If it is necessary to partition RCU read-side critical sections in this +manner, it is necessary to use two grace periods, where the first grace +period is known to end before the second grace period starts: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 r2 = READ_ONCE(c); + 19 synchronize_rcu(); + 20 WRITE_ONCE(d, 1); + 21 } + 22 + 23 void thread3(void) + 24 { + 25 rcu_read_lock(); + 26 r3 = READ_ONCE(b); + 27 r4 = READ_ONCE(d); + 28 rcu_read_unlock(); + 29 } + +Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen +before the end of ``thread1()``'s grace period. If in addition +``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after +the beginning of ``thread2()``'s grace period. If it is also the case +that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must +precede the beginning of ``thread2()``'s grace period. This mean that +the two RCU read-side critical sections cannot overlap, guaranteeing +that ``(r3 == 1)``. As a result, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) + +cannot happen. + +This non-requirement was also non-premeditated, but became apparent when +studying RCU's interaction with memory ordering. + +Read-Side Critical Sections Don't Partition Grace Periods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is also tempting to assume that if an RCU read-side critical section +happens between a pair of grace periods, then those grace periods cannot +overlap. However, this temptation leads nowhere good, as can be +illustrated by the following, with all variables initially zero: + + :: + + 1 void thread0(void) + 2 { + 3 rcu_read_lock(); + 4 WRITE_ONCE(a, 1); + 5 WRITE_ONCE(b, 1); + 6 rcu_read_unlock(); + 7 } + 8 + 9 void thread1(void) + 10 { + 11 r1 = READ_ONCE(a); + 12 synchronize_rcu(); + 13 WRITE_ONCE(c, 1); + 14 } + 15 + 16 void thread2(void) + 17 { + 18 rcu_read_lock(); + 19 WRITE_ONCE(d, 1); + 20 r2 = READ_ONCE(c); + 21 rcu_read_unlock(); + 22 } + 23 + 24 void thread3(void) + 25 { + 26 r3 = READ_ONCE(d); + 27 synchronize_rcu(); + 28 WRITE_ONCE(e, 1); + 29 } + 30 + 31 void thread4(void) + 32 { + 33 rcu_read_lock(); + 34 r4 = READ_ONCE(b); + 35 r5 = READ_ONCE(e); + 36 rcu_read_unlock(); + 37 } + +In this case, the outcome: + + :: + + (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) + +is entirely possible, as illustrated below: + +.. kernel-figure:: ReadersPartitionGP1.svg + +Again, an RCU read-side critical section can overlap almost all of a +given grace period, just so long as it does not overlap the entire grace +period. As a result, an RCU read-side critical section cannot partition +a pair of RCU grace periods. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How long a sequence of grace periods, each separated by an RCU | +| read-side critical section, would be required to partition the RCU | +| read-side critical sections at the beginning and end of the chain? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| In theory, an infinite number. In practice, an unknown number that is | +| sensitive to both implementation details and timing considerations. | +| Therefore, even in practice, RCU users must abide by the theoretical | +| rather than the practical answer. | ++-----------------------------------------------------------------------+ + +Parallelism Facts of Life +------------------------- + +These parallelism facts of life are by no means specific to RCU, but the +RCU implementation must abide by them. They therefore bear repeating: + +#. Any CPU or task may be delayed at any time, and any attempts to avoid + these delays by disabling preemption, interrupts, or whatever are + completely futile. This is most obvious in preemptible user-level + environments and in virtualized environments (where a given guest + OS's VCPUs can be preempted at any time by the underlying + hypervisor), but can also happen in bare-metal environments due to + ECC errors, NMIs, and other hardware events. Although a delay of more + than about 20 seconds can result in splats, the RCU implementation is + obligated to use algorithms that can tolerate extremely long delays, + but where “extremely long” is not long enough to allow wrap-around + when incrementing a 64-bit counter. +#. Both the compiler and the CPU can reorder memory accesses. Where it + matters, RCU must use compiler directives and memory-barrier + instructions to preserve ordering. +#. Conflicting writes to memory locations in any given cache line will + result in expensive cache misses. Greater numbers of concurrent + writes and more-frequent concurrent writes will result in more + dramatic slowdowns. RCU is therefore obligated to use algorithms that + have sufficient locality to avoid significant performance and + scalability problems. +#. As a rough rule of thumb, only one CPU's worth of processing may be + carried out under the protection of any given exclusive lock. RCU + must therefore use scalable locking designs. +#. Counters are finite, especially on 32-bit systems. RCU's use of + counters must therefore tolerate counter wrap, or be designed such + that counter wrap would take way more time than a single system is + likely to run. An uptime of ten years is quite possible, a runtime of + a century much less so. As an example of the latter, RCU's + dyntick-idle nesting counter allows 54 bits for interrupt nesting + level (this counter is 64 bits even on a 32-bit system). Overflowing + this counter requires 2\ :sup:`54` half-interrupts on a given CPU + without that CPU ever going idle. If a half-interrupt happened every + microsecond, it would take 570 years of runtime to overflow this + counter, which is currently believed to be an acceptably long time. +#. Linux systems can have thousands of CPUs running a single Linux + kernel in a single shared-memory environment. RCU must therefore pay + close attention to high-end scalability. + +This last parallelism fact of life means that RCU must pay special +attention to the preceding facts of life. The idea that Linux might +scale to systems with thousands of CPUs would have been met with some +skepticism in the 1990s, but these requirements would have otherwise +have been unsurprising, even in the early 1990s. + +Quality-of-Implementation Requirements +-------------------------------------- + +These sections list quality-of-implementation requirements. Although an +RCU implementation that ignores these requirements could still be used, +it would likely be subject to limitations that would make it +inappropriate for industrial-strength production use. Classes of +quality-of-implementation requirements are as follows: + +#. `Specialization <#Specialization>`__ +#. `Performance and Scalability <#Performance%20and%20Scalability>`__ +#. `Forward Progress <#Forward%20Progress>`__ +#. `Composability <#Composability>`__ +#. `Corner Cases <#Corner%20Cases>`__ + +These classes is covered in the following sections. + +Specialization +~~~~~~~~~~~~~~ + +RCU is and always has been intended primarily for read-mostly +situations, which means that RCU's read-side primitives are optimized, +often at the expense of its update-side primitives. Experience thus far +is captured by the following list of situations: + +#. Read-mostly data, where stale and inconsistent data is not a problem: + RCU works great! +#. Read-mostly data, where data must be consistent: RCU works well. +#. Read-write data, where data must be consistent: RCU *might* work OK. + Or not. +#. Write-mostly data, where data must be consistent: RCU is very + unlikely to be the right tool for the job, with the following + exceptions, where RCU can provide: + + a. Existence guarantees for update-friendly mechanisms. + b. Wait-free read-side primitives for real-time use. + +This focus on read-mostly situations means that RCU must interoperate +with other synchronization primitives. For example, the ``add_gp()`` and +``remove_gp_synchronous()`` examples discussed earlier use RCU to +protect readers and locking to coordinate updaters. However, the need +extends much farther, requiring that a variety of synchronization +primitives be legal within RCU read-side critical sections, including +spinlocks, sequence locks, atomic operations, reference counters, and +memory barriers. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| What about sleeping locks? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| These are forbidden within Linux-kernel RCU read-side critical | +| sections because it is not legal to place a quiescent state (in this | +| case, voluntary context switch) within an RCU read-side critical | +| section. However, sleeping locks may be used within userspace RCU | +| read-side critical sections, and also within Linux-kernel sleepable | +| RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | +| addition, the -rt patchset turns spinlocks into a sleeping locks so | +| that the corresponding critical sections can be preempted, which also | +| means that these sleeplockified spinlocks (but not other sleeping | +| locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | +| sections. | +| Note that it *is* legal for a normal RCU read-side critical section | +| to conditionally acquire a sleeping locks (as in | +| ``mutex_trylock()``), but only as long as it does not loop | +| indefinitely attempting to conditionally acquire that sleeping locks. | +| The key point is that things like ``mutex_trylock()`` either return | +| with the mutex held, or return an error indication if the mutex was | +| not immediately available. Either way, ``mutex_trylock()`` returns | +| immediately without sleeping. | ++-----------------------------------------------------------------------+ + +It often comes as a surprise that many algorithms do not require a +consistent view of data, but many can function in that mode, with +network routing being the poster child. Internet routing algorithms take +significant time to propagate updates, so that by the time an update +arrives at a given system, that system has been sending network traffic +the wrong way for a considerable length of time. Having a few threads +continue to send traffic the wrong way for a few more milliseconds is +clearly not a problem: In the worst case, TCP retransmissions will +eventually get the data where it needs to go. In general, when tracking +the state of the universe outside of the computer, some level of +inconsistency must be tolerated due to speed-of-light delays if nothing +else. + +Furthermore, uncertainty about external state is inherent in many cases. +For example, a pair of veterinarians might use heartbeat to determine +whether or not a given cat was alive. But how long should they wait +after the last heartbeat to decide that the cat is in fact dead? Waiting +less than 400 milliseconds makes no sense because this would mean that a +relaxed cat would be considered to cycle between death and life more +than 100 times per minute. Moreover, just as with human beings, a cat's +heart might stop for some period of time, so the exact wait period is a +judgment call. One of our pair of veterinarians might wait 30 seconds +before pronouncing the cat dead, while the other might insist on waiting +a full minute. The two veterinarians would then disagree on the state of +the cat during the final 30 seconds of the minute following the last +heartbeat. + +Interestingly enough, this same situation applies to hardware. When push +comes to shove, how do we tell whether or not some external server has +failed? We send messages to it periodically, and declare it failed if we +don't receive a response within a given period of time. Policy decisions +can usually tolerate short periods of inconsistency. The policy was +decided some time ago, and is only now being put into effect, so a few +milliseconds of delay is normally inconsequential. + +However, there are algorithms that absolutely must see consistent data. +For example, the translation between a user-level SystemV semaphore ID +to the corresponding in-kernel data structure is protected by RCU, but +it is absolutely forbidden to update a semaphore that has just been +removed. In the Linux kernel, this need for consistency is accommodated +by acquiring spinlocks located in the in-kernel data structure from +within the RCU read-side critical section, and this is indicated by the +green box in the figure above. Many other techniques may be used, and +are in fact used within the Linux kernel. + +In short, RCU is not required to maintain consistency, and other +mechanisms may be used in concert with RCU when consistency is required. +RCU's specialization allows it to do its job extremely well, and its +ability to interoperate with other synchronization mechanisms allows the +right mix of synchronization tools to be used for a given job. + +Performance and Scalability +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Energy efficiency is a critical component of performance today, and +Linux-kernel RCU implementations must therefore avoid unnecessarily +awakening idle CPUs. I cannot claim that this requirement was +premeditated. In fact, I learned of it during a telephone conversation +in which I was given “frank and open” feedback on the importance of +energy efficiency in battery-powered systems and on specific +energy-efficiency shortcomings of the Linux-kernel RCU implementation. +In my experience, the battery-powered embedded community will consider +any unnecessary wakeups to be extremely unfriendly acts. So much so that +mere Linux-kernel-mailing-list posts are insufficient to vent their ire. + +Memory consumption is not particularly important for in most situations, +and has become decreasingly so as memory sizes have expanded and memory +costs have plummeted. However, as I learned from Matt Mackall's +`bloatwatch `__ efforts, memory +footprint is critically important on single-CPU systems with +non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny +RCU `__ +was born. Josh Triplett has since taken over the small-memory banner +with his `Linux kernel tinification `__ +project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional +for those kernels not needing it. + +The remaining performance requirements are, for the most part, +unsurprising. For example, in keeping with RCU's read-side +specialization, ``rcu_dereference()`` should have negligible overhead +(for example, suppression of a few minor compiler optimizations). +Similarly, in non-preemptible environments, ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have exactly zero overhead. + +In preemptible environments, in the case where the RCU read-side +critical section was not preempted (as will be the case for the +highest-priority real-time process), ``rcu_read_lock()`` and +``rcu_read_unlock()`` should have minimal overhead. In particular, they +should not contain atomic read-modify-write operations, memory-barrier +instructions, preemption disabling, interrupt disabling, or backwards +branches. However, in the case where the RCU read-side critical section +was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable +interrupts. This is why it is better to nest an RCU read-side critical +section within a preempt-disable region than vice versa, at least in +cases where that critical section is short enough to avoid unduly +degrading real-time latencies. + +The ``synchronize_rcu()`` grace-period-wait primitive is optimized for +throughput. It may therefore incur several milliseconds of latency in +addition to the duration of the longest RCU read-side critical section. +On the other hand, multiple concurrent invocations of +``synchronize_rcu()`` are required to use batching optimizations so that +they can be satisfied by a single underlying grace-period-wait +operation. For example, in the Linux kernel, it is not unusual for a +single grace-period-wait operation to serve more than `1,000 separate +invocations `__ +of ``synchronize_rcu()``, thus amortizing the per-invocation overhead +down to nearly zero. However, the grace-period optimization is also +required to avoid measurable degradation of real-time scheduling and +interrupt latencies. + +In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are +unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be +used instead, reducing the grace-period latency down to a few tens of +microseconds on small systems, at least in cases where the RCU read-side +critical sections are short. There are currently no special latency +requirements for ``synchronize_rcu_expedited()`` on large systems, but, +consistent with the empirical nature of the RCU specification, that is +subject to change. However, there most definitely are scalability +requirements: A storm of ``synchronize_rcu_expedited()`` invocations on +4096 CPUs should at least make reasonable forward progress. In return +for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted +to impose modest degradation of real-time latency on non-idle online +CPUs. Here, “modest” means roughly the same latency degradation as a +scheduling-clock interrupt. + +There are a number of situations where even +``synchronize_rcu_expedited()``'s reduced grace-period latency is +unacceptable. In these situations, the asynchronous ``call_rcu()`` can +be used in place of ``synchronize_rcu()`` as follows: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 static void remove_gp_cb(struct rcu_head *rhp) + 8 { + 9 struct foo *p = container_of(rhp, struct foo, rh); + 10 + 11 kfree(p); + 12 } + 13 + 14 bool remove_gp_asynchronous(void) + 15 { + 16 struct foo *p; + 17 + 18 spin_lock(&gp_lock); + 19 p = rcu_access_pointer(gp); + 20 if (!p) { + 21 spin_unlock(&gp_lock); + 22 return false; + 23 } + 24 rcu_assign_pointer(gp, NULL); + 25 call_rcu(&p->rh, remove_gp_cb); + 26 spin_unlock(&gp_lock); + 27 return true; + 28 } + +A definition of ``struct foo`` is finally needed, and appears on +lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` +on line 25, and will be invoked after the end of a subsequent grace +period. This gets the same effect as ``remove_gp_synchronous()``, but +without forcing the updater to wait for a grace period to elapse. The +``call_rcu()`` function may be used in a number of situations where +neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would +be legal, including within preempt-disable code, ``local_bh_disable()`` +code, interrupt-disable code, and interrupt handlers. However, even +``call_rcu()`` is illegal within NMI handlers and from idle and offline +CPUs. The callback function (``remove_gp_cb()`` in this case) will be +executed within softirq (software interrupt) environment within the +Linux kernel, either within a real softirq handler or under the +protection of ``local_bh_disable()``. In both the Linux kernel and in +userspace, it is bad practice to write an RCU callback function that +takes too long. Long-running operations should be relegated to separate +threads or (in the Linux kernel) workqueues. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why does line 19 use ``rcu_access_pointer()``? After all, | +| ``call_rcu()`` on line 25 stores into the structure, which would | +| interact badly with concurrent insertions. Doesn't this mean that | +| ``rcu_dereference()`` is required? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Presumably the ``->gp_lock`` acquired on line 18 excludes any | +| changes, including any insertions that ``rcu_dereference()`` would | +| protect against. Therefore, any insertions will be delayed until | +| after ``->gp_lock`` is released on line 25, which in turn means that | +| ``rcu_access_pointer()`` suffices. | ++-----------------------------------------------------------------------+ + +However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on +the data element. This is a common idiom, and is supported by +``kfree_rcu()``, which allows “fire and forget” operation as shown +below: + + :: + + 1 struct foo { + 2 int a; + 3 int b; + 4 struct rcu_head rh; + 5 }; + 6 + 7 bool remove_gp_faf(void) + 8 { + 9 struct foo *p; + 10 + 11 spin_lock(&gp_lock); + 12 p = rcu_dereference(gp); + 13 if (!p) { + 14 spin_unlock(&gp_lock); + 15 return false; + 16 } + 17 rcu_assign_pointer(gp, NULL); + 18 kfree_rcu(p, rh); + 19 spin_unlock(&gp_lock); + 20 return true; + 21 } + +Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and +proceeds, without any need to pay any further attention to the +subsequent grace period and ``kfree()``. It is permissible to invoke +``kfree_rcu()`` from the same environments as for ``call_rcu()``. +Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` +and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the +fact that RCU was not heavily used within DYNIX/ptx, so the very few +places that needed something like ``synchronize_rcu()`` simply +open-coded it. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | +| allowed updaters to avoid being blocked by readers. But how can that | +| be correct, given that the invocation of the callback and the freeing | +| of the memory (respectively) must still wait for a grace period to | +| elapse? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| We could define things this way, but keep in mind that this sort of | +| definition would say that updates in garbage-collected languages | +| cannot complete until the next time the garbage collector runs, which | +| does not seem at all reasonable. The key point is that in most cases, | +| an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | +| to the next update as soon as it has invoked ``call_rcu()`` or | +| ``kfree_rcu()``, without having to wait for a subsequent grace | +| period. | ++-----------------------------------------------------------------------+ + +But what if the updater must wait for the completion of code to be +executed after the end of the grace period, but has other tasks that can +be carried out in the meantime? The polling-style +``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions +may be used for this purpose, as shown below: + + :: + + 1 bool remove_gp_poll(void) + 2 { + 3 struct foo *p; + 4 unsigned long s; + 5 + 6 spin_lock(&gp_lock); + 7 p = rcu_access_pointer(gp); + 8 if (!p) { + 9 spin_unlock(&gp_lock); + 10 return false; + 11 } + 12 rcu_assign_pointer(gp, NULL); + 13 spin_unlock(&gp_lock); + 14 s = get_state_synchronize_rcu(); + 15 do_something_while_waiting(); + 16 cond_synchronize_rcu(s); + 17 kfree(p); + 18 return true; + 19 } + +On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, +then line 15 carries out other tasks, and finally, line 16 returns +immediately if a grace period has elapsed in the meantime, but otherwise +waits as required. The need for ``get_state_synchronize_rcu`` and +``cond_synchronize_rcu()`` has appeared quite recently, so it is too +early to tell whether they will stand the test of time. + +RCU thus provides a range of tools to allow updaters to strike the +required tradeoff between latency, flexibility and CPU overhead. + +Forward Progress +~~~~~~~~~~~~~~~~ + +In theory, delaying grace-period completion and callback invocation is +harmless. In practice, not only are memory sizes finite but also +callbacks sometimes do wakeups, and sufficiently deferred wakeups can be +difficult to distinguish from system hangs. Therefore, RCU must provide +a number of mechanisms to promote forward progress. + +These mechanisms are not foolproof, nor can they be. For one simple +example, an infinite loop in an RCU read-side critical section must by +definition prevent later grace periods from ever completing. For a more +involved example, consider a 64-CPU system built with +``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where +CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even +if these tight loops also contain calls to ``cond_resched()`` (thus +allowing grace periods to complete), CPU 0 simply will not be able to +invoke callbacks as fast as the other 63 CPUs can register them, at +least not until the system runs out of memory. In both of these +examples, the Spiderman principle applies: With great power comes great +responsibility. However, short of this level of abuse, RCU is required +to ensure timely completion of grace periods and timely invocation of +callbacks. + +RCU takes the following steps to encourage timely completion of grace +periods: + +#. If a grace period fails to complete within 100 milliseconds, RCU + causes future invocations of ``cond_resched()`` on the holdout CPUs + to provide an RCU quiescent state. RCU also causes those CPUs' + ``need_resched()`` invocations to return ``true``, but only after the + corresponding CPU's next scheduling-clock. +#. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run + indefinitely in the kernel without scheduling-clock interrupts, which + defeats the above ``need_resched()`` strategem. RCU will therefore + invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out + after 109 milliseconds. +#. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that + has been preempted within an RCU read-side critical section is + holding out for more than 500 milliseconds, RCU will resort to + priority boosting. +#. If a CPU is still holding out 10 seconds into the grace period, RCU + will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` + state. + +The above values are defaults for systems running with ``HZ=1000``. They +will vary as the value of ``HZ`` varies, and can also be changed using +the relevant Kconfig options and kernel boot parameters. RCU currently +does not do much sanity checking of these parameters, so please use +caution when changing them. Note that these forward-progress measures +are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. + +RCU takes the following steps in ``call_rcu()`` to encourage timely +invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has +10,000 callbacks, or has 10,000 more callbacks than it had the last time +encouragement was provided: + +#. Starts a grace period, if one is not already in progress. +#. Forces immediate checking for quiescent states, rather than waiting + for three milliseconds to have elapsed since the beginning of the + grace period. +#. Immediately tags the CPU's callbacks with their grace period + completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` + handler to get around to it. +#. Lifts callback-execution batch limits, which speeds up callback + invocation at the expense of degrading realtime response. + +Again, these are default values when running at ``HZ=1000``, and can be +overridden. Again, these forward-progress measures are provided only for +RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks +RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward +progress for ``rcu_nocbs`` CPUs is much less well-developed, in part +because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke +``call_rcu()`` relatively infrequently. If workloads emerge that need +both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then +additional forward-progress work will be required. + +Composability +~~~~~~~~~~~~~ + +Composability has received much attention in recent years, perhaps in +part due to the collision of multicore hardware with object-oriented +techniques designed in single-threaded environments for single-threaded +use. And in theory, RCU read-side critical sections may be composed, and +in fact may be nested arbitrarily deeply. In practice, as with all +real-world implementations of composable constructs, there are +limitations. + +Implementations of RCU for which ``rcu_read_lock()`` and +``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when +``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there +is no overhead. Except that if all these instances of +``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the +compiler, compilation will eventually fail due to exhausting memory, +mass storage, or user patience, whichever comes first. If the nesting is +not visible to the compiler, as is the case with mutually recursive +functions each in its own translation unit, stack overflow will result. +If the nesting takes the form of loops, perhaps in the guise of tail +recursion, either the control variable will overflow or (in the Linux +kernel) you will get an RCU CPU stall warning. Nevertheless, this class +of RCU implementations is one of the most composable constructs in +existence. + +RCU implementations that explicitly track nesting depth are limited by +the nesting-depth counter. For example, the Linux kernel's preemptible +RCU limits nesting to ``INT_MAX``. This should suffice for almost all +practical purposes. That said, a consecutive pair of RCU read-side +critical sections between which there is an operation that waits for a +grace period cannot be enclosed in another RCU read-side critical +section. This is because it is not legal to wait for a grace period +within an RCU read-side critical section: To do so would result either +in deadlock or in RCU implicitly splitting the enclosing RCU read-side +critical section, neither of which is conducive to a long-lived and +prosperous kernel. + +It is worth noting that RCU is not alone in limiting composability. For +example, many transactional-memory implementations prohibit composing a +pair of transactions separated by an irrevocable operation (for example, +a network receive operation). For another example, lock-based critical +sections can be composed surprisingly freely, but only if deadlock is +avoided. + +In short, although RCU read-side critical sections are highly +composable, care is required in some situations, just as is the case for +any other composable synchronization mechanism. + +Corner Cases +~~~~~~~~~~~~ + +A given RCU workload might have an endless and intense stream of RCU +read-side critical sections, perhaps even so intense that there was +never a point in time during which there was not at least one RCU +read-side critical section in flight. RCU cannot allow this situation to +block grace periods: As long as all the RCU read-side critical sections +are finite, grace periods must also be finite. + +That said, preemptible RCU implementations could potentially result in +RCU read-side critical sections being preempted for long durations, +which has the effect of creating a long-duration RCU read-side critical +section. This situation can arise only in heavily loaded systems, but +systems using real-time priorities are of course more vulnerable. +Therefore, RCU priority boosting is provided to help deal with this +case. That said, the exact requirements on RCU priority boosting will +likely evolve as more experience accumulates. + +Other workloads might have very high update rates. Although one can +argue that such workloads should instead use something other than RCU, +the fact remains that RCU must handle such workloads gracefully. This +requirement is another factor driving batching of grace periods, but it +is also the driving force behind the checks for large numbers of queued +RCU callbacks in the ``call_rcu()`` code path. Finally, high update +rates should not delay RCU read-side critical sections, although some +small read-side delays can occur when using +``synchronize_rcu_expedited()``, courtesy of this function's use of +``smp_call_function_single()``. + +Although all three of these corner cases were understood in the early +1990s, a simple user-level test consisting of ``close(open(path))`` in a +tight loop in the early 2000s suddenly provided a much deeper +appreciation of the high-update-rate corner case. This test also +motivated addition of some RCU code to react to high update rates, for +example, if a given CPU finds itself with more than 10,000 RCU callbacks +queued, it will cause RCU to take evasive action by more aggressively +starting grace periods and more aggressively forcing completion of +grace-period processing. This evasive action causes the grace period to +complete more quickly, but at the cost of restricting RCU's batching +optimizations, thus increasing the CPU overhead incurred by that grace +period. + +Software-Engineering Requirements +--------------------------------- + +Between Murphy's Law and “To err is human”, it is necessary to guard +against mishaps and misuse: + +#. It is all too easy to forget to use ``rcu_read_lock()`` everywhere + that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will + splat if ``rcu_dereference()`` is used outside of an RCU read-side + critical section. Update-side code can use + ``rcu_dereference_protected()``, which takes a `lockdep + expression `__ to indicate what is + providing the protection. If the indicated protection is not + provided, a lockdep splat is emitted. + Code shared between readers and updaters can use + ``rcu_dereference_check()``, which also takes a lockdep expression, + and emits a lockdep splat if neither ``rcu_read_lock()`` nor the + indicated protection is in place. In addition, + ``rcu_dereference_raw()`` is used in those (hopefully rare) cases + where the required protection cannot be easily described. Finally, + ``rcu_read_lock_held()`` is provided to allow a function to verify + that it has been invoked within an RCU read-side critical section. I + was made aware of this set of requirements shortly after Thomas + Gleixner audited a number of RCU uses. +#. A given function might wish to check for RCU-related preconditions + upon entry, before using any other RCU API. The + ``rcu_lockdep_assert()`` does this job, asserting the expression in + kernels having lockdep enabled and doing nothing otherwise. +#. It is also easy to forget to use ``rcu_assign_pointer()`` and + ``rcu_dereference()``, perhaps (incorrectly) substituting a simple + assignment. To catch this sort of error, a given RCU-protected + pointer may be tagged with ``__rcu``, after which sparse will + complain about simple-assignment accesses to that pointer. Arnd + Bergmann made me aware of this requirement, and also supplied the + needed `patch series `__. +#. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if + a data element is passed to ``call_rcu()`` twice in a row, without a + grace period in between. (This error is similar to a double free.) + The corresponding ``rcu_head`` structures that are dynamically + allocated are automatically tracked, but ``rcu_head`` structures + allocated on the stack must be initialized with + ``init_rcu_head_on_stack()`` and cleaned up with + ``destroy_rcu_head_on_stack()``. Similarly, statically allocated + non-stack ``rcu_head`` structures must be initialized with + ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. + Mathieu Desnoyers made me aware of this requirement, and also + supplied the needed + `patch `__. +#. An infinite loop in an RCU read-side critical section will eventually + trigger an RCU CPU stall warning splat, with the duration of + “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` + ``Kconfig`` option, or, alternatively, by the + ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU + is not obligated to produce this splat unless there is a grace period + waiting on that particular RCU read-side critical section. + + Some extreme workloads might intentionally delay RCU grace periods, + and systems running those workloads can be booted with + ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This + kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU + stall warnings are counter-productive during sysrq dumps and during + panics. RCU therefore supplies the ``rcu_sysrq_start()`` and + ``rcu_sysrq_end()`` API members to be called before and after long + sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is + automatically invoked at the beginning of a panic to suppress further + RCU CPU stall warnings. + + This requirement made itself known in the early 1990s, pretty much + the first time that it was necessary to debug a CPU stall. That said, + the initial implementation in DYNIX/ptx was quite generic in + comparison with that of Linux. + +#. Although it would be very good to detect pointers leaking out of RCU + read-side critical sections, there is currently no good way of doing + this. One complication is the need to distinguish between pointers + leaking and pointers that have been handed off from RCU to some other + synchronization mechanism, for example, reference counting. +#. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information + is provided via event tracing. +#. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` + to create typical linked data structures can be surprisingly + error-prone. Therefore, RCU-protected `linked + lists `__ and, + more recently, RCU-protected `hash + tables `__ are available. Many + other special-purpose RCU-protected data structures are available in + the Linux kernel and the userspace RCU library. +#. Some linked structures are created at compile time, but still require + ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves + this purpose. +#. It is not necessary to use ``rcu_assign_pointer()`` when creating + linked structures that are to be published via a single external + pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task + and also for assigning ``NULL`` pointers at runtime. + +This not a hard-and-fast list: RCU's diagnostic capabilities will +continue to be guided by the number and type of usage bugs found in +real-world RCU usage. + +Linux Kernel Complications +-------------------------- + +The Linux kernel provides an interesting environment for all kinds of +software, including RCU. Some of the relevant points of interest are as +follows: + +#. `Configuration <#Configuration>`__. +#. `Firmware Interface <#Firmware%20Interface>`__. +#. `Early Boot <#Early%20Boot>`__. +#. `Interrupts and non-maskable interrupts + (NMIs) <#Interrupts%20and%20NMIs>`__. +#. `Loadable Modules <#Loadable%20Modules>`__. +#. `Hotplug CPU <#Hotplug%20CPU>`__. +#. `Scheduler and RCU <#Scheduler%20and%20RCU>`__. +#. `Tracing and RCU <#Tracing%20and%20RCU>`__. +#. `Energy Efficiency <#Energy%20Efficiency>`__. +#. `Scheduling-Clock Interrupts and + RCU <#Scheduling-Clock%20Interrupts%20and%20RCU>`__. +#. `Memory Efficiency <#Memory%20Efficiency>`__. +#. `Performance, Scalability, Response Time, and + Reliability <#Performance,%20Scalability,%20Response%20Time,%20and%20Reliability>`__. + +This list is probably incomplete, but it does give a feel for the most +notable Linux-kernel complications. Each of the following sections +covers one of the above topics. + +Configuration +~~~~~~~~~~~~~ + +RCU's goal is automatic configuration, so that almost nobody needs to +worry about RCU's ``Kconfig`` options. And for almost all users, RCU +does in fact work well “out of the box.” + +However, there are specialized use cases that are handled by kernel boot +parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` +system will explicitly ask users about new ``Kconfig`` options, which +requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` +``Kconfig`` option. + +This all should be quite obvious, but the fact remains that Linus +Torvalds recently had to +`remind `__ +me of this requirement. + +Firmware Interface +~~~~~~~~~~~~~~~~~~ + +In many cases, kernel obtains information about the system from the +firmware, and sometimes things are lost in translation. Or the +translation is accurate, but the original message is bogus. + +For example, some systems' firmware overreports the number of CPUs, +sometimes by a large factor. If RCU naively believed the firmware, as it +used to do, it would create too many per-CPU kthreads. Although the +resulting system will still run correctly, the extra kthreads needlessly +consume memory and can cause confusion when they show up in ``ps`` +listings. + +RCU must therefore wait for a given CPU to actually come online before +it can allow itself to believe that the CPU actually exists. The +resulting “ghost CPUs” (which are never going to come online) cause a +number of `interesting +complications `__. + +Early Boot +~~~~~~~~~~ + +The Linux kernel's boot sequence is an interesting process, and RCU is +used early, even before ``rcu_init()`` is invoked. In fact, a number of +RCU's primitives can be used as soon as the initial task's +``task_struct`` is available and the boot CPU's per-CPU variables are +set up. The read-side primitives (``rcu_read_lock()``, +``rcu_read_unlock()``, ``rcu_dereference()``, and +``rcu_access_pointer()``) will operate normally very early on, as will +``rcu_assign_pointer()``. + +Although ``call_rcu()`` may be invoked at any time during boot, +callbacks are not guaranteed to be invoked until after all of RCU's +kthreads have been spawned, which occurs at ``early_initcall()`` time. +This delay in callback invocation is due to the fact that RCU does not +invoke callbacks until it is fully initialized, and this full +initialization cannot occur until after the scheduler has initialized +itself to the point where RCU can spawn and run its kthreads. In theory, +it would be possible to invoke callbacks earlier, however, this is not a +panacea because there would be severe restrictions on what operations +those callbacks could invoke. + +Perhaps surprisingly, ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``, will operate normally during very early +boot, the reason being that there is only one CPU and preemption is +disabled. This means that the call ``synchronize_rcu()`` (or friends) +itself is a quiescent state and thus a grace period, so the early-boot +implementation can be a no-op. + +However, once the scheduler has spawned its first kthread, this early +boot trick fails for ``synchronize_rcu()`` (as well as for +``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The +reason is that an RCU read-side critical section might be preempted, +which means that a subsequent ``synchronize_rcu()`` really does have to +wait for something, as opposed to simply returning immediately. +Unfortunately, ``synchronize_rcu()`` can't do this until all of its +kthreads are spawned, which doesn't happen until some time during +``early_initcalls()`` time. But this is no excuse: RCU is nevertheless +required to correctly handle synchronous grace periods during this time +period. Once all of its kthreads are up and running, RCU starts running +normally. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| How can RCU possibly handle grace periods before all of its kthreads | +| have been spawned??? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Very carefully! | +| During the “dead zone” between the time that the scheduler spawns the | +| first task and the time that all of RCU's kthreads have been spawned, | +| all synchronous grace periods are handled by the expedited | +| grace-period mechanism. At runtime, this expedited mechanism relies | +| on workqueues, but during the dead zone the requesting task itself | +| drives the desired expedited grace period. Because dead-zone | +| execution takes place within task context, everything works. Once the | +| dead zone ends, expedited grace periods go back to using workqueues, | +| as is required to avoid problems that would otherwise occur when a | +| user task received a POSIX signal while driving an expedited grace | +| period. | +| | +| And yes, this does mean that it is unhelpful to send POSIX signals to | +| random tasks between the time that the scheduler spawns its first | +| kthread and the time that RCU's kthreads have all been spawned. If | +| there ever turns out to be a good reason for sending POSIX signals | +| during that time, appropriate adjustments will be made. (If it turns | +| out that POSIX signals are sent during this time for no good reason, | +| other adjustments will be made, appropriate or otherwise.) | ++-----------------------------------------------------------------------+ + +I learned of these boot-time requirements as a result of a series of +system hangs. + +Interrupts and NMIs +~~~~~~~~~~~~~~~~~~~ + +The Linux kernel has interrupts, and RCU read-side critical sections are +legal within interrupt handlers and within interrupt-disabled regions of +code, as are invocations of ``call_rcu()``. + +Some Linux-kernel architectures can enter an interrupt handler from +non-idle process context, and then just never leave it, instead +stealthily transitioning back to process context. This trick is +sometimes used to invoke system calls from inside the kernel. These +“half-interrupts” mean that RCU has to be very careful about how it +counts interrupt nesting levels. I learned of this requirement the hard +way during a rewrite of RCU's dyntick-idle code. + +The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side +critical sections are legal within NMI handlers. Thankfully, RCU +update-side primitives, including ``call_rcu()``, are prohibited within +NMI handlers. + +The name notwithstanding, some Linux-kernel architectures can have +nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised +me `__ +with this requirement; he also kindly surprised me with `an +algorithm `__ +that meets this requirement. + +Furthermore, NMI handlers can be interrupted by what appear to RCU to be +normal interrupts. One way that this can happen is for code that +directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called +from an NMI handler. This astonishing fact of life prompted the current +code structure, which has ``rcu_irq_enter()`` invoking +``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. +And yes, I also learned of this requirement the hard way. + +Loadable Modules +~~~~~~~~~~~~~~~~ + +The Linux kernel has loadable modules, and these modules can also be +unloaded. After a given module has been unloaded, any attempt to call +one of its functions results in a segmentation fault. The module-unload +functions must therefore cancel any delayed calls to loadable-module +functions, for example, any outstanding ``mod_timer()`` must be dealt +with via ``del_timer_sync()`` or similar. + +Unfortunately, there is no way to cancel an RCU callback; once you +invoke ``call_rcu()``, the callback function is eventually going to be +invoked, unless the system goes down first. Because it is normally +considered socially irresponsible to crash the system in response to a +module unload request, we need some other way to deal with in-flight RCU +callbacks. + +RCU therefore provides ``rcu_barrier()``, which waits until all +in-flight RCU callbacks have been invoked. If a module uses +``call_rcu()``, its exit function should therefore prevent any future +invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, +the underlying module-unload code could invoke ``rcu_barrier()`` +unconditionally, but in practice this would incur unacceptable +latencies. + +Nikita Danilov noted this requirement for an analogous +filesystem-unmount situation, and Dipankar Sarma incorporated +``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module +unloading became apparent later. + +.. important:: + + The ``rcu_barrier()`` function is not, repeat, + *not*, obligated to wait for a grace period. It is instead only required + to wait for RCU callbacks that have already been posted. Therefore, if + there are no RCU callbacks posted anywhere in the system, + ``rcu_barrier()`` is within its rights to return immediately. Even if + there are callbacks posted, ``rcu_barrier()`` does not necessarily need + to wait for a grace period. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Wait a minute! Each RCU callbacks must wait for a grace period to | +| complete, and ``rcu_barrier()`` must wait for each pre-existing | +| callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | +| wait for a full grace period if there is even one callback posted | +| anywhere in the system? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Absolutely not!!! | +| Yes, each RCU callbacks must wait for a grace period to complete, but | +| it might well be partly (or even completely) finished waiting by the | +| time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | +| need only wait for the remaining portion of the grace period to | +| elapse. So even if there are quite a few callbacks posted, | +| ``rcu_barrier()`` might well return quite quickly. | +| | +| So if you need to wait for a grace period as well as for all | +| pre-existing callbacks, you will need to invoke both | +| ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | +| you can always use workqueues to invoke them concurrently. | ++-----------------------------------------------------------------------+ + +Hotplug CPU +~~~~~~~~~~~ + +The Linux kernel supports CPU hotplug, which means that CPUs can come +and go. It is of course illegal to use any RCU API member from an +offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side +critical sections. This requirement was present from day one in +DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug +implementation is “interesting.” + +The Linux-kernel CPU-hotplug implementation has notifiers that are used +to allow the various kernel subsystems (including RCU) to respond +appropriately to a given CPU-hotplug operation. Most RCU operations may +be invoked from CPU-hotplug notifiers, including even synchronous +grace-period operations such as ``synchronize_rcu()`` and +``synchronize_rcu_expedited()``. + +However, all-callback-wait operations such as ``rcu_barrier()`` are also +not supported, due to the fact that there are phases of CPU-hotplug +operations where the outgoing CPU's callbacks will not be invoked until +after the CPU-hotplug operation ends, which could also result in +deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations +during its execution, which results in another type of deadlock when +invoked from a CPU-hotplug notifier. + +Scheduler and RCU +~~~~~~~~~~~~~~~~~ + +RCU depends on the scheduler, and the scheduler uses RCU to protect some +of its data structures. The preemptible-RCU ``rcu_read_unlock()`` +implementation must therefore be written carefully to avoid deadlocks +involving the scheduler's runqueue and priority-inheritance locks. In +particular, ``rcu_read_unlock()`` must tolerate an interrupt where the +interrupt handler invokes both ``rcu_read_lock()`` and +``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()`` +to use negative nesting levels to avoid destructive recursion via +interrupt handler's use of RCU. + +This scheduler-RCU requirement came as a `complete +surprise `__. + +As noted above, RCU makes use of kthreads, and it is necessary to avoid +excessive CPU-time accumulation by these kthreads. This requirement was +no surprise, but RCU's violation of it when running context-switch-heavy +workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a +surprise +[PDF] `__. +RCU has made good progress towards meeting this requirement, even for +context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is +room for further improvement. + +It is forbidden to hold any of scheduler's runqueue or +priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless +interrupts have been disabled across the entire RCU read-side critical +section, that is, up to and including the matching ``rcu_read_lock()``. +Violating this restriction can result in deadlocks involving these +scheduler spinlocks. There was hope that this restriction might be +lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started +deferring the reporting of the resulting RCU-preempt quiescent state +until the end of the corresponding interrupts-disabled region. +Unfortunately, timely reporting of the corresponding quiescent state to +expedited grace periods requires a call to ``raise_softirq()``, which +can acquire these scheduler spinlocks. In addition, real-time systems +using RCU priority boosting need this restriction to remain in effect +because deferred quiescent-state reporting would also defer deboosting, +which in turn would degrade real-time latencies. + +In theory, if a given RCU read-side critical section could be guaranteed +to be less than one second in duration, holding a scheduler spinlock +across that critical section's ``rcu_read_unlock()`` would require only +that preemption be disabled across the entire RCU read-side critical +section, not interrupts. Unfortunately, given the possibility of vCPU +preemption, long-running interrupts, and so on, it is not possible in +practice to guarantee that a given RCU read-side critical section will +complete in less than one second. Therefore, as noted above, if +scheduler spinlocks are held across a given call to +``rcu_read_unlock()``, interrupts must be disabled across the entire RCU +read-side critical section. + +Tracing and RCU +~~~~~~~~~~~~~~~ + +It is possible to use tracing on RCU code, but tracing itself uses RCU. +For this reason, ``rcu_dereference_raw_notrace()`` is provided for use +by tracing, which avoids the destructive recursion that could otherwise +ensue. This API is also used by virtualization in some architectures, +where RCU readers execute in environments in which tracing cannot be +used. The tracing folks both located the requirement and provided the +needed fix, so this surprise requirement was relatively painless. + +Energy Efficiency +~~~~~~~~~~~~~~~~~ + +Interrupting idle CPUs is considered socially unacceptable, especially +by people with battery-powered embedded systems. RCU therefore conserves +energy by detecting which CPUs are idle, including tracking CPUs that +have been interrupted from idle. This is a large part of the +energy-efficiency requirement, so I learned of this via an irate phone +call. + +Because RCU avoids interrupting idle CPUs, it is illegal to execute an +RCU read-side critical section on an idle CPU. (Kernels built with +``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` +macro and ``_rcuidle`` event tracing is provided to work around this +restriction. In addition, ``rcu_is_watching()`` may be used to test +whether or not it is currently legal to run RCU read-side critical +sections on this CPU. I learned of the need for diagnostics on the one +hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. +Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite +heavily in the idle loop. However, there are some restrictions on the +code placed within ``RCU_NONIDLE()``: + +#. Blocking is prohibited. In practice, this is not a serious + restriction given that idle tasks are prohibited from blocking to + begin with. +#. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest + indefinitely deeply. However, given that they can be nested on the + order of a million deep, even on 32-bit systems, this should not be a + serious restriction. This nesting limit would probably be reached + long after the compiler OOMed or the stack overflowed. +#. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that + same ``RCU_NONIDLE()``. For example, the following is grossly + illegal: + + :: + + 1 RCU_NONIDLE({ + 2 do_something(); + 3 goto bad_idea; /* BUG!!! */ + 4 do_something_else();}); + 5 bad_idea: + + + It is just as illegal to transfer control into the middle of + ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in + as long as you also transferred out, but in practice you could also + expect to get sharply worded review comments. + +It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU +running in userspace. RCU must therefore track ``nohz_full`` userspace +execution. RCU must therefore be able to sample state at two points in +time, and be able to determine whether or not some other CPU spent any +time idle and/or executing in userspace. + +These energy-efficiency requirements have proven quite difficult to +understand and to meet, for example, there have been more than five +clean-sheet rewrites of RCU's energy-efficiency code, the last of which +was finally able to demonstrate `real energy savings running on real +hardware +[PDF] `__. +As noted earlier, I learned of many of these requirements via angry +phone calls: Flaming me on the Linux-kernel mailing list was apparently +not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. Depending on kernel configuration, RCU +handles these states differently: + ++-----------------+------------------+------------------+-----------------+ +| ``HZ`` Kconfig | In-Kernel | Usermode | Idle | ++=================+==================+==================+=================+ +| ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | +| | scheduling-clock | scheduling-clock | RCU's | +| | interrupt. | interrupt and | dyntick-idle | +| | | its detection | detection. | +| | | of interrupt | | +| | | from usermode. | | ++-----------------+------------------+------------------+-----------------+ +| ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | +| | sometimes rely | RCU's | RCU's | +| | on | dyntick-idle | dyntick-idle | +| | scheduling-clock | detection. | detection. | +| | interrupt. In | | | +| | other cases, it | | | +| | is necessary to | | | +| | bound kernel | | | +| | execution times | | | +| | and/or use | | | +| | IPIs. | | | ++-----------------+------------------+------------------+-----------------+ + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | +| scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | +| ``NO_HZ_IDLE`` do? | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| Because, as a performance optimization, ``NO_HZ_FULL`` does not | +| necessarily re-enable the scheduling-clock interrupt on entry to each | +| and every system call. | ++-----------------------------------------------------------------------+ + +However, RCU must be reliably informed as to whether any given CPU is +currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that +CPU is executing in usermode, as discussed +`earlier <#Energy%20Efficiency>`__. It also requires that the +scheduling-clock interrupt be enabled when RCU needs it to be: + +#. If a CPU is either idle or executing in usermode, and RCU believes it + is non-idle, the scheduling-clock tick had better be running. + Otherwise, you will get RCU CPU stall warnings. Or at best, very long + (11-second) grace periods, with a pointless IPI waking the CPU from + time to time. +#. If a CPU is in a portion of the kernel that executes RCU read-side + critical sections, and RCU believes this CPU to be idle, you will get + random memory corruption. **DON'T DO THIS!!!** + This is one reason to test with lockdep, which will complain about + this sort of thing. +#. If a CPU is in a portion of the kernel that is absolutely positively + no-joking guaranteed to never execute any RCU read-side critical + sections, and RCU believes this CPU to to be idle, no problem. This + sort of thing is used by some architectures for light-weight + exception handlers, which can then avoid the overhead of + ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and + exit, respectively. Some go further and avoid the entireties of + ``irq_enter()`` and ``irq_exit()``. + Just make very sure you are running some of your tests with + ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in + fact joking about not doing RCU read-side critical sections. +#. If a CPU is executing in the kernel with the scheduling-clock + interrupt disabled and RCU believes this CPU to be non-idle, and if + the CPU goes idle (from an RCU perspective) every few jiffies, no + problem. It is usually OK for there to be the occasional gap between + idle periods of up to a second or so. + If the gap grows too long, you get RCU CPU stall warnings. +#. If a CPU is either idle or executing in usermode, and RCU believes it + to be idle, of course no problem. +#. If a CPU is executing in the kernel, the kernel code path is passing + through quiescent states at a reasonable frequency (preferably about + once per few jiffies, but the occasional excursion to a second or so + is usually OK) and the scheduling-clock interrupt is enabled, of + course no problem. + If the gap between a successive pair of quiescent states grows too + long, you get RCU CPU stall warnings. + ++-----------------------------------------------------------------------+ +| **Quick Quiz**: | ++-----------------------------------------------------------------------+ +| But what if my driver has a hardware interrupt handler that can run | +| for many seconds? I cannot invoke ``schedule()`` from an hardware | +| interrupt handler, after all! | ++-----------------------------------------------------------------------+ +| **Answer**: | ++-----------------------------------------------------------------------+ +| One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | +| often. But given that long-running interrupt handlers can cause other | +| problems, not least for response time, shouldn't you work to keep | +| your interrupt handler's runtime within reasonable bounds? | ++-----------------------------------------------------------------------+ + +But as long as RCU is properly informed of kernel state transitions +between in-kernel execution, usermode execution, and idle, and as long +as the scheduling-clock interrupt is enabled when RCU needs it to be, +you can rest assured that the bugs you encounter will be in some other +part of RCU or some other part of the kernel! + +Memory Efficiency +~~~~~~~~~~~~~~~~~ + +Although small-memory non-realtime systems can simply use Tiny RCU, code +size is only one aspect of memory efficiency. Another aspect is the size +of the ``rcu_head`` structure used by ``call_rcu()`` and +``kfree_rcu()``. Although this structure contains nothing more than a +pair of pointers, it does appear in many RCU-protected data structures, +including some that are size critical. The ``page`` structure is a case +in point, as evidenced by the many occurrences of the ``union`` keyword +within that structure. + +This need for memory efficiency is one reason that RCU uses hand-crafted +singly linked lists to track the ``rcu_head`` structures that are +waiting for a grace period to elapse. It is also the reason why +``rcu_head`` structures do not contain debug information, such as fields +tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that +posted them. Although this information might appear in debug-only kernel +builds at some point, in the meantime, the ``->func`` field will often +provide the needed debug information. + +However, in some cases, the need for memory efficiency leads to even +more extreme measures. Returning to the ``page`` structure, the +``rcu_head`` field shares storage with a great many other structures +that are used at various points in the corresponding page's lifetime. In +order to correctly resolve certain `race +conditions `__, +the Linux kernel's memory-management subsystem needs a particular bit to +remain zero during all phases of grace-period processing, and that bit +happens to map to the bottom bit of the ``rcu_head`` structure's +``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is +used to post the callback, as opposed to ``kfree_rcu()`` or some future +“lazy” variant of ``call_rcu()`` that might one day be created for +energy-efficiency purposes. + +That said, there are limits. RCU requires that the ``rcu_head`` +structure be aligned to a two-byte boundary, and passing a misaligned +``rcu_head`` structure to one of the ``call_rcu()`` family of functions +will result in a splat. It is therefore necessary to exercise caution +when packing structures containing fields of type ``rcu_head``. Why not +a four-byte or even eight-byte alignment requirement? Because the m68k +architecture provides only two-byte alignment, and thus acts as +alignment's least common denominator. + +The reason for reserving the bottom bit of pointers to ``rcu_head`` +structures is to leave the door open to “lazy” callbacks whose +invocations can safely be deferred. Deferring invocation could +potentially have energy-efficiency benefits, but only if the rate of +non-lazy callbacks decreases significantly for some important workload. +In the meantime, reserving the bottom bit keeps this option open in case +it one day becomes useful. + +Performance, Scalability, Response Time, and Reliability +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Expanding on the `earlier +discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by +hot code paths in performance-critical portions of the Linux kernel's +networking, security, virtualization, and scheduling code paths. RCU +must therefore use efficient implementations, especially in its +read-side primitives. To that end, it would be good if preemptible RCU's +implementation of ``rcu_read_lock()`` could be inlined, however, doing +this requires resolving ``#include`` issues with the ``task_struct`` +structure. + +The Linux kernel supports hardware configurations with up to 4096 CPUs, +which means that RCU must be extremely scalable. Algorithms that involve +frequent acquisitions of global locks or frequent atomic operations on +global variables simply cannot be tolerated within the RCU +implementation. RCU therefore makes heavy use of a combining tree based +on the ``rcu_node`` structure. RCU is required to tolerate all CPUs +continuously invoking any combination of RCU's runtime primitives with +minimal per-operation overhead. In fact, in many cases, increasing load +must *decrease* the per-operation overhead, witness the batching +optimizations for ``synchronize_rcu()``, ``call_rcu()``, +``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general +rule, RCU must cheerfully accept whatever the rest of the Linux kernel +decides to throw at it. + +The Linux kernel is used for real-time workloads, especially in +conjunction with the `-rt +patchset `__. The +real-time-latency response requirements are such that the traditional +approach of disabling preemption across RCU read-side critical sections +is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use +an RCU implementation that allows RCU read-side critical sections to be +preempted. This requirement made its presence known after users made it +clear that an earlier `real-time +patch `__ did not meet their needs, in +conjunction with some `RCU +issues `__ +encountered by a very early version of the -rt patchset. + +In addition, RCU must make do with a sub-100-microsecond real-time +latency budget. In fact, on smaller systems with the -rt patchset, the +Linux kernel provides sub-20-microsecond real-time latencies for the +whole kernel, including RCU. RCU's scalability and latency must +therefore be sufficient for these sorts of configurations. To my +surprise, the sub-100-microsecond real-time latency budget `applies to +even the largest systems +[PDF] `__, +up to and including systems with 4096 CPUs. This real-time requirement +motivated the grace-period kthread, which also simplified handling of a +number of race conditions. + +RCU must avoid degrading real-time response for CPU-bound threads, +whether executing in usermode (which is one use case for +``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in +the kernel must execute ``cond_resched()`` at least once per few tens of +milliseconds in order to avoid receiving an IPI from RCU. + +Finally, RCU's status as a synchronization primitive means that any RCU +failure can result in arbitrary memory corruption that can be extremely +difficult to debug. This means that RCU must be extremely reliable, +which in practice also means that RCU must have an aggressive +stress-test suite. This stress-test suite is called ``rcutorture``. + +Although the need for ``rcutorture`` was no surprise, the current +immense popularity of the Linux kernel is posing interesting—and perhaps +unprecedented—validation challenges. To see this, keep in mind that +there are well over one billion instances of the Linux kernel running +today, given Android smartphones, Linux-powered televisions, and +servers. This number can be expected to increase sharply with the advent +of the celebrated Internet of Things. + +Suppose that RCU contains a race condition that manifests on average +once per million years of runtime. This bug will be occurring about +three times per *day* across the installed base. RCU could simply hide +behind hardware error rates, given that no one should really expect +their smartphone to last for a million years. However, anyone taking too +much comfort from this thought should consider the fact that in most +jurisdictions, a successful multi-year test of a given mechanism, which +might include a Linux kernel, suffices for a number of types of +safety-critical certifications. In fact, rumor has it that the Linux +kernel is already being used in production for safety-critical +applications. I don't know about you, but I would feel quite bad if a +bug in RCU killed someone. Which might explain my recent focus on +validation and verification. + +Other RCU Flavors +----------------- + +One of the more surprising things about RCU is that there are now no +fewer than five *flavors*, or API families. In addition, the primary +flavor that has been the sole focus up to this point has two different +implementations, non-preemptible and preemptible. The other four flavors +are listed below, with requirements for each described in a separate +section. + +#. `Bottom-Half Flavor (Historical) <#Bottom-Half%20Flavor>`__ +#. `Sched Flavor (Historical) <#Sched%20Flavor>`__ +#. `Sleepable RCU <#Sleepable%20RCU>`__ +#. `Tasks RCU <#Tasks%20RCU>`__ + +Bottom-Half Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-bh flavor of RCU has since been expressed in terms of the other +RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +softirq and to be accounted for by lockdep. Much of the material in this +section is therefore strictly historical in nature. + +The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) +flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a +flavor of RCU that could withstand the network-based denial-of-service +attacks researched by Robert Olsson. These attacks placed so much +networking load on the system that some of the CPUs never exited softirq +execution, which in turn prevented those CPUs from ever executing a +context switch, which, in the RCU implementation of that time, prevented +grace periods from ever ending. The result was an out-of-memory +condition and a system hang. + +The solution was the creation of RCU-bh, which does +``local_bh_disable()`` across its read-side critical sections, and which +uses the transition from one type of softirq processing to another as a +quiescent state in addition to context switch, idle, user mode, and +offline. This means that RCU-bh grace periods can complete even when +some of the CPUs execute in softirq indefinitely, thus allowing +algorithms based on RCU-bh to withstand network-based denial-of-service +attacks. + +Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and +re-enable softirq handlers, any attempt to start a softirq handlers +during the RCU-bh read-side critical section will be deferred. In this +case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can +take considerable time. One can of course argue that this softirq +overhead should be associated with the code following the RCU-bh +read-side critical section rather than ``rcu_read_unlock_bh()``, but the +fact is that most profiling tools cannot be expected to make this sort +of fine distinction. For example, suppose that a three-millisecond-long +RCU-bh read-side critical section executes during a time of heavy +networking load. There will very likely be an attempt to invoke at least +one softirq handler during that three milliseconds, but any such +invocation will be delayed until the time of the +``rcu_read_unlock_bh()``. This can of course make it appear at first +glance as if ``rcu_read_unlock_bh()`` was executing very slowly. + +The `RCU-bh +API `__ +includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, +``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, +``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, +``call_rcu_bh()``, ``rcu_barrier_bh()``, and +``rcu_read_lock_bh_held()``. However, the update-side APIs are now +simple wrappers for other RCU flavors, namely RCU-sched in +CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. + +Sched Flavor (Historical) +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The RCU-sched flavor of RCU has since been expressed in terms of the +other RCU flavors as part of a consolidation of the three flavors into a +single flavor. The read-side API remains, and continues to disable +preemption and to be accounted for by lockdep. Much of the material in +this section is therefore strictly historical in nature. + +Before preemptible RCU, waiting for an RCU grace period had the side +effect of also waiting for all pre-existing interrupt and NMI handlers. +However, there are legitimate preemptible-RCU implementations that do +not have this property, given that any point in the code outside of an +RCU read-side critical section can be a quiescent state. Therefore, +*RCU-sched* was created, which follows “classic” RCU in that an +RCU-sched grace period waits for for pre-existing interrupt and NMI +handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and +RCU-sched APIs have identical implementations, while kernels built with +``CONFIG_PREEMPT=y`` provide a separate implementation for each. + +Note well that in ``CONFIG_PREEMPT=y`` kernels, +``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and +re-enable preemption, respectively. This means that if there was a +preemption attempt during the RCU-sched read-side critical section, +``rcu_read_unlock_sched()`` will enter the scheduler, with all the +latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, +this can make it look as if ``rcu_read_unlock_sched()`` was executing +very slowly. However, the highest-priority task won't be preempted, so +that task will enjoy low-overhead ``rcu_read_unlock_sched()`` +invocations. + +The `RCU-sched +API `__ +includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, +``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, +``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, +``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, +``call_rcu_sched()``, ``rcu_barrier_sched()``, and +``rcu_read_lock_sched_held()``. However, anything that disables +preemption also marks an RCU-sched read-side critical section, including +``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and +``local_irq_restore()``, and so on. + +Sleepable RCU +~~~~~~~~~~~~~ + +For well over a decade, someone saying “I need to block within an RCU +read-side critical section” was a reliable indication that this someone +did not understand RCU. After all, if you are always blocking in an RCU +read-side critical section, you can probably afford to use a +higher-overhead synchronization mechanism. However, that changed with +the advent of the Linux kernel's notifiers, whose RCU read-side critical +sections almost never sleep, but sometimes need to. This resulted in the +introduction of `sleepable RCU `__, or +*SRCU*. + +SRCU allows different domains to be defined, with each such domain +defined by an instance of an ``srcu_struct`` structure. A pointer to +this structure must be passed in to each SRCU function, for example, +``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` +structure. The key benefit of these domains is that a slow SRCU reader +in one domain does not delay an SRCU grace period in some other domain. +That said, one consequence of these domains is that read-side code must +pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for +example, as follows: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 srcu_read_unlock(&ss, idx); + +As noted above, it is legal to block within SRCU read-side critical +sections, however, with great power comes great responsibility. If you +block forever in one of a given domain's SRCU read-side critical +sections, then that domain's grace periods will also be blocked forever. +Of course, one good way to block forever is to deadlock, which can +happen if any operation in a given domain's SRCU read-side critical +section can wait, either directly or indirectly, for that domain's grace +period to elapse. For example, this results in a self-deadlock: + + :: + + 1 int idx; + 2 + 3 idx = srcu_read_lock(&ss); + 4 do_something(); + 5 synchronize_srcu(&ss); + 6 srcu_read_unlock(&ss, idx); + +However, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for domain ``ss``, deadlock would still be +possible. Furthermore, if line 5 acquired a mutex that was held across a +``synchronize_srcu()`` for some other domain ``ss1``, and if an +``ss1``-domain SRCU read-side critical section acquired another mutex +that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock +would again be possible. Such a deadlock cycle could extend across an +arbitrarily large number of different SRCU domains. Again, with great +power comes great responsibility. + +Unlike the other RCU flavors, SRCU read-side critical sections can run +on idle and even offline CPUs. This ability requires that +``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, +which means that SRCU readers will run a bit slower than would RCU +readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, +which, in combination with ``srcu_read_unlock()``, guarantees a full +memory barrier. + +Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be +invoked from CPU-hotplug notifiers, due to the fact that SRCU grace +periods make use of timers and the possibility of timers being +temporarily “stranded” on the outgoing CPU. This stranding of timers +means that timers posted to the outgoing CPU will not fire until late in +the CPU-hotplug process. The problem is that if a notifier is waiting on +an SRCU grace period, that grace period is waiting on a timer, and that +timer is stranded on the outgoing CPU, then the notifier will never be +awakened, in other words, deadlock has occurred. This same situation of +course also prohibits ``srcu_barrier()`` from being invoked from +CPU-hotplug notifiers. + +SRCU also differs from other RCU flavors in that SRCU's expedited and +non-expedited grace periods are implemented by the same mechanism. This +means that in the current SRCU implementation, expediting a future grace +period has the side effect of expediting all prior grace periods that +have not yet completed. (But please note that this is a property of the +current implementation, not necessarily of future implementations.) In +addition, if SRCU has been idle for longer than the interval specified +by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds +by default), and if a ``synchronize_srcu()`` invocation ends this idle +period, that invocation will be automatically expedited. + +As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a +locking bottleneck present in prior kernel versions. Although this will +allow users to put much heavier stress on ``call_srcu()``, it is +important to note that SRCU does not yet take any special steps to deal +with callback flooding. So if you are posting (say) 10,000 SRCU +callbacks per second per CPU, you are probably totally OK, but if you +intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please +run some tests first. SRCU just might need a few adjustment to deal with +that sort of load. Of course, your mileage may vary based on the speed +of your CPUs and the size of your memory. + +The `SRCU +API `__ +includes ``srcu_read_lock()``, ``srcu_read_unlock()``, +``srcu_dereference()``, ``srcu_dereference_check()``, +``synchronize_srcu()``, ``synchronize_srcu_expedited()``, +``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It +also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and +``init_srcu_struct()`` APIs for defining and initializing +``srcu_struct`` structures. + +Tasks RCU +~~~~~~~~~ + +Some forms of tracing use “trampolines” to handle the binary rewriting +required to install different types of probes. It would be good to be +able to free old trampolines, which sounds like a job for some form of +RCU. However, because it is necessary to be able to install a trace +anywhere in the code, it is not possible to use read-side markers such +as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does +not work to have these markers in the trampoline itself, because there +would need to be instructions following ``rcu_read_unlock()``. Although +``synchronize_rcu()`` would guarantee that execution reached the +``rcu_read_unlock()``, it would not be able to guarantee that execution +had completely left the trampoline. + +The solution, in the form of `Tasks +RCU `__, is to have implicit read-side +critical sections that are delimited by voluntary context switches, that +is, calls to ``schedule()``, ``cond_resched()``, and +``synchronize_rcu_tasks()``. In addition, transitions to and from +userspace execution also delimit tasks-RCU read-side critical sections. + +The tasks-RCU API is quite compact, consisting only of +``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and +``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines +cannot be preempted, so these APIs map to ``call_rcu()``, +``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In +``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these +three APIs are therefore implemented by separate functions that check +for voluntary context switches. + +Possible Future Changes +----------------------- + +One of the tricks that RCU uses to attain update-side scalability is to +increase grace-period latency with increasing numbers of CPUs. If this +becomes a serious problem, it will be necessary to rework the +grace-period state machine so as to avoid the need for the additional +latency. + +RCU disables CPU hotplug in a few places, perhaps most notably in the +``rcu_barrier()`` operations. If there is a strong reason to use +``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to +avoid disabling CPU hotplug. This would introduce some complexity, so +there had better be a *very* good reason. + +The tradeoff between grace-period latency on the one hand and +interruptions of other CPUs on the other hand may need to be +re-examined. The desire is of course for zero grace-period latency as +well as zero interprocessor interrupts undertaken during an expedited +grace period operation. While this ideal is unlikely to be achievable, +it is quite possible that further improvements can be made. + +The multiprocessor implementations of RCU use a combining tree that +groups CPUs so as to reduce lock contention and increase cache locality. +However, this combining tree does not spread its memory across NUMA +nodes nor does it align the CPU groups with hardware features such as +sockets or cores. Such spreading and alignment is currently believed to +be unnecessary because the hotpath read-side primitives do not access +the combining tree, nor does ``call_rcu()`` in the common case. If you +believe that your architecture needs such spreading and alignment, then +your architecture should also benefit from the +``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the +number of CPUs in a socket, NUMA node, or whatever. If the number of +CPUs is too large, use a fraction of the number of CPUs. If the number +of CPUs is a large prime number, well, that certainly is an +“interesting” architectural choice! More flexible arrangements might be +considered, but only if ``rcutree.rcu_fanout_leaf`` has proven +inadequate, and only if the inadequacy has been demonstrated by a +carefully run and realistic system-level workload. + +Please note that arrangements that require RCU to remap CPU numbers will +require extremely good demonstration of need and full exploration of +alternatives. + +RCU's various kthreads are reasonably recent additions. It is quite +likely that adjustments will be required to more gracefully handle +extreme loads. It might also be necessary to be able to relate CPU +utilization by RCU's kthreads and softirq handlers to the code that +instigated this CPU utilization. For example, RCU callback overhead +might be charged back to the originating ``call_rcu()`` instance, though +probably not in production kernels. + +Additional work may be required to provide reasonable forward-progress +guarantees under heavy load for grace periods and for callback +invocation. + +Summary +------- + +This document has presented more than two decade's worth of RCU +requirements. Given that the requirements keep changing, this will not +be the last word on this subject, but at least it serves to get an +important subset of the requirements set forth. + +Acknowledgments +--------------- + +I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg +Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy +Lutomirski for their help in rendering this article human readable, and +to Michelle Rankin for her support of this effort. Other contributions +are acknowledged in the Linux kernel's git archive. diff --git a/Documentation/RCU/index.rst b/Documentation/RCU/index.rst index 340a9725676cd2c42af68a4553d298ac13637e1a..94427dc1f23d6b66d63360331578c562fcf76e0d 100644 --- a/Documentation/RCU/index.rst +++ b/Documentation/RCU/index.rst @@ -11,6 +11,11 @@ RCU concepts listRCU UP + Design/Memory-Ordering/Tree-RCU-Memory-Ordering + Design/Expedited-Grace-Periods/Expedited-Grace-Periods + Design/Requirements/Requirements + Design/Data-Structures/Data-Structures + .. only:: subproject and html Indices diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 7e1a8721637ab764c10106b8e947b5fce9c4358d..17f48319ee16d631023753f6ae0949052f0bf68a 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -302,7 +302,7 @@ rcu_dereference() must prohibit. The rcu_dereference_protected() variant takes a lockdep expression to indicate which locks must be acquired by the caller. If the indicated protection is not provided, - a lockdep splat is emitted. See RCU/Design/Requirements/Requirements.html + a lockdep splat is emitted. See Documentation/RCU/Design/Requirements/Requirements.rst and the API's code comments for more details and example usage. The following diagram shows how each API communicates among the @@ -630,7 +630,7 @@ been able to write-acquire the lock otherwise. The smp_mb__after_spinlock() promotes synchronize_rcu() to a full memory barrier in compliance with the "Memory-Barrier Guarantees" listed in: - Documentation/RCU/Design/Requirements/Requirements.html. + Documentation/RCU/Design/Requirements/Requirements.rst It is possible to nest rcu_read_lock(), since reader-writer locks may be recursively acquired. Note also that rcu_read_lock() is immune