Loading Documentation/nvdimm/nvdimm.txt +28 −21 Original line number Original line Diff line number Diff line Loading @@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to mmap persistent memory, from a PMEM block device, directly into a mmap persistent memory, from a PMEM block device, directly into a process address space. process address space. DSM: Device Specific Method: ACPI method to to control specific device - in this case the firmware. DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. It defines a vendor-id, device-id, and interface format for a given DIMM. BTT: Block Translation Table: Persistent memory is byte addressable. BTT: Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes. The BTT is an indirection of writes is at least one sector, 512 bytes. The BTT is an indirection Loading Loading @@ -133,16 +139,16 @@ device driver: registered, can be immediately attached to nd_pmem. registered, can be immediately attached to nd_pmem. 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform defined apertures. A set of apertures will all access just one DIMM. defined apertures. A set of apertures will access just one DIMM. Multiple windows allow multiple concurrent accesses, much like Multiple windows (apertures) allow multiple concurrent accesses, much like tagged-command-queuing, and would likely be used by different threads or tagged-command-queuing, and would likely be used by different threads or different CPUs. different CPUs. The NFIT specification defines a standard format for a BLK-aperture, but The NFIT specification defines a standard format for a BLK-aperture, but the spec also allows for vendor specific layouts, and non-NFIT BLK the spec also allows for vendor specific layouts, and non-NFIT BLK implementations may other designs for BLK I/O. For this reason "nd_blk" implementations may have other designs for BLK I/O. For this reason calls back into platform-specific code to perform the I/O. One such "nd_blk" calls back into platform-specific code to perform the I/O. implementation is defined in the "Driver Writer's Guide" and "DSM One such implementation is defined in the "Driver Writer's Guide" and "DSM Interface Example". Interface Example". Loading @@ -152,7 +158,7 @@ Why BLK? While PMEM provides direct byte-addressable CPU-load/store access to While PMEM provides direct byte-addressable CPU-load/store access to NVDIMM storage, it does not provide the best system RAS (recovery, NVDIMM storage, it does not provide the best system RAS (recovery, availability, and serviceability) model. An access to a corrupted availability, and serviceability) model. An access to a corrupted system-physical-address address causes a cpu exception while an access system-physical-address address causes a CPU exception while an access to a corrupted address through an BLK-aperture causes that block window to a corrupted address through an BLK-aperture causes that block window to raise an error status in a register. The latter is more aligned with to raise an error status in a register. The latter is more aligned with the standard error model that host-bus-adapter attached disks present. the standard error model that host-bus-adapter attached disks present. Loading @@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across several DIMMs. several DIMMs. PMEM vs BLK PMEM vs BLK BLK-apertures solve this RAS problem, but their presence is also the BLK-apertures solve these RAS problems, but their presence is also the major contributing factor to the complexity of the ND subsystem. They major contributing factor to the complexity of the ND subsystem. They complicate the implementation because PMEM and BLK alias in DPA space. complicate the implementation because PMEM and BLK alias in DPA space. Any given DIMM's DPA-range may contribute to one or more Any given DIMM's DPA-range may contribute to one or more Loading Loading @@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified by a region device with a dynamically assigned id (REGION0 - REGION5). by a region device with a dynamically assigned id (REGION0 - REGION5). 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A single PMEM namespace is created in the REGION0-SPA-range that spans single PMEM namespace is created in the REGION0-SPA-range that spans most DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that interleaved system-physical-address range is reclaimed as BLK-aperture interleaved system-physical-address range is reclaimed as BLK-aperture accessed space starting at DPA-offset (a) into each DIMM. In that accessed space starting at DPA-offset (a) into each DIMM. In that reclaimed space we create two BLK-aperture "namespaces" from REGION2 and reclaimed space we create two BLK-aperture "namespaces" from REGION2 and Loading @@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5). 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 2. In the last portion of DIMM0 and DIMM1 we have an interleaved system-physical-address range, REGION1, that spans those two DIMMs as system-physical-address range, REGION1, that spans those two DIMMs as well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and "blk5.0". "blk5.0". 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 interleaved system-physical-address range (i.e. the DPA address below interleaved system-physical-address range (i.e. the DPA address past offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. Note, that this example shows that BLK-aperture namespaces don't need to Note, that this example shows that BLK-aperture namespaces don't need to be contiguous in DPA-space. be contiguous in DPA-space. Loading @@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API What follows is a description of the LIBNVDIMM sysfs layout and a What follows is a description of the LIBNVDIMM sysfs layout and a corresponding object hierarchy diagram as viewed through the LIBNDCTL corresponding object hierarchy diagram as viewed through the LIBNDCTL api. The example sysfs paths and diagrams are relative to the Example API. The example sysfs paths and diagrams are relative to the Example NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit test. test. LIBNDCTL: Context LIBNDCTL: Context Every api call in the LIBNDCTL library requires a context that holds the Every API call in the LIBNDCTL library requires a context that holds the logging parameters and other library instance state. The library is logging parameters and other library instance state. The library is based on the libabc template: based on the libabc template: https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git LIBNDCTL: instantiate a new library context example LIBNDCTL: instantiate a new library context example Loading Loading @@ -509,7 +515,7 @@ At first glance it seems since NFIT defines just PMEM and BLK interface types that we should simply name REGION devices with something derived types that we should simply name REGION devices with something derived from those type names. However, the ND subsystem explicitly keeps the from those type names. However, the ND subsystem explicitly keeps the REGION name generic and expects userspace to always consider the REGION name generic and expects userspace to always consider the region-attributes for 4 reasons: region-attributes for four reasons: 1. There are already more than two REGION and "namespace" types. For 1. There are already more than two REGION and "namespace" types. For PMEM there are two subtypes. As mentioned previously we have PMEM where PMEM there are two subtypes. As mentioned previously we have PMEM where Loading Loading @@ -698,8 +704,8 @@ static int configure_namespace(struct ndctl_region *region, Why the Term "namespace"? Why the Term "namespace"? 1. Why not "volume" for instance? "volume" ran the risk of confusing ND 1. Why not "volume" for instance? "volume" ran the risk of confusing as a volume manager like device-mapper. ND (libnvdimm subsystem) to a volume manager like device-mapper. 2. The term originated to describe the sub-devices that can be created 2. The term originated to describe the sub-devices that can be created within a NVME controller (see the nvme specification: within a NVME controller (see the nvme specification: Loading Loading @@ -774,13 +780,14 @@ block" needs to be destroyed. Note, that to destroy a BTT the media needs to be written in raw mode. By default, the kernel will autodetect needs to be written in raw mode. By default, the kernel will autodetect the presence of a BTT and disable raw mode. This autodetect behavior the presence of a BTT and disable raw mode. This autodetect behavior can be suppressed by enabling raw mode for the namespace via the can be suppressed by enabling raw mode for the namespace via the ndctl_namespace_set_raw_mode() api. ndctl_namespace_set_raw_mode() API. Summary LIBNDCTL Diagram Summary LIBNDCTL Diagram ------------------------ ------------------------ For the given example above, here is the view of the objects as seen by the LIBNDCTL api: For the given example above, here is the view of the objects as seen by the LIBNDCTL API: +---+ +---+ |CTX| +---------+ +--------------+ +---------------+ |CTX| +---------+ +--------------+ +---------------+ +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | Loading Loading
Documentation/nvdimm/nvdimm.txt +28 −21 Original line number Original line Diff line number Diff line Loading @@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to mmap persistent memory, from a PMEM block device, directly into a mmap persistent memory, from a PMEM block device, directly into a process address space. process address space. DSM: Device Specific Method: ACPI method to to control specific device - in this case the firmware. DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. It defines a vendor-id, device-id, and interface format for a given DIMM. BTT: Block Translation Table: Persistent memory is byte addressable. BTT: Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes. The BTT is an indirection of writes is at least one sector, 512 bytes. The BTT is an indirection Loading Loading @@ -133,16 +139,16 @@ device driver: registered, can be immediately attached to nd_pmem. registered, can be immediately attached to nd_pmem. 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform defined apertures. A set of apertures will all access just one DIMM. defined apertures. A set of apertures will access just one DIMM. Multiple windows allow multiple concurrent accesses, much like Multiple windows (apertures) allow multiple concurrent accesses, much like tagged-command-queuing, and would likely be used by different threads or tagged-command-queuing, and would likely be used by different threads or different CPUs. different CPUs. The NFIT specification defines a standard format for a BLK-aperture, but The NFIT specification defines a standard format for a BLK-aperture, but the spec also allows for vendor specific layouts, and non-NFIT BLK the spec also allows for vendor specific layouts, and non-NFIT BLK implementations may other designs for BLK I/O. For this reason "nd_blk" implementations may have other designs for BLK I/O. For this reason calls back into platform-specific code to perform the I/O. One such "nd_blk" calls back into platform-specific code to perform the I/O. implementation is defined in the "Driver Writer's Guide" and "DSM One such implementation is defined in the "Driver Writer's Guide" and "DSM Interface Example". Interface Example". Loading @@ -152,7 +158,7 @@ Why BLK? While PMEM provides direct byte-addressable CPU-load/store access to While PMEM provides direct byte-addressable CPU-load/store access to NVDIMM storage, it does not provide the best system RAS (recovery, NVDIMM storage, it does not provide the best system RAS (recovery, availability, and serviceability) model. An access to a corrupted availability, and serviceability) model. An access to a corrupted system-physical-address address causes a cpu exception while an access system-physical-address address causes a CPU exception while an access to a corrupted address through an BLK-aperture causes that block window to a corrupted address through an BLK-aperture causes that block window to raise an error status in a register. The latter is more aligned with to raise an error status in a register. The latter is more aligned with the standard error model that host-bus-adapter attached disks present. the standard error model that host-bus-adapter attached disks present. Loading @@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across several DIMMs. several DIMMs. PMEM vs BLK PMEM vs BLK BLK-apertures solve this RAS problem, but their presence is also the BLK-apertures solve these RAS problems, but their presence is also the major contributing factor to the complexity of the ND subsystem. They major contributing factor to the complexity of the ND subsystem. They complicate the implementation because PMEM and BLK alias in DPA space. complicate the implementation because PMEM and BLK alias in DPA space. Any given DIMM's DPA-range may contribute to one or more Any given DIMM's DPA-range may contribute to one or more Loading Loading @@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified by a region device with a dynamically assigned id (REGION0 - REGION5). by a region device with a dynamically assigned id (REGION0 - REGION5). 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A single PMEM namespace is created in the REGION0-SPA-range that spans single PMEM namespace is created in the REGION0-SPA-range that spans most DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that interleaved system-physical-address range is reclaimed as BLK-aperture interleaved system-physical-address range is reclaimed as BLK-aperture accessed space starting at DPA-offset (a) into each DIMM. In that accessed space starting at DPA-offset (a) into each DIMM. In that reclaimed space we create two BLK-aperture "namespaces" from REGION2 and reclaimed space we create two BLK-aperture "namespaces" from REGION2 and Loading @@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5). 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 2. In the last portion of DIMM0 and DIMM1 we have an interleaved system-physical-address range, REGION1, that spans those two DIMMs as system-physical-address range, REGION1, that spans those two DIMMs as well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and "blk5.0". "blk5.0". 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 interleaved system-physical-address range (i.e. the DPA address below interleaved system-physical-address range (i.e. the DPA address past offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. Note, that this example shows that BLK-aperture namespaces don't need to Note, that this example shows that BLK-aperture namespaces don't need to be contiguous in DPA-space. be contiguous in DPA-space. Loading @@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API What follows is a description of the LIBNVDIMM sysfs layout and a What follows is a description of the LIBNVDIMM sysfs layout and a corresponding object hierarchy diagram as viewed through the LIBNDCTL corresponding object hierarchy diagram as viewed through the LIBNDCTL api. The example sysfs paths and diagrams are relative to the Example API. The example sysfs paths and diagrams are relative to the Example NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit test. test. LIBNDCTL: Context LIBNDCTL: Context Every api call in the LIBNDCTL library requires a context that holds the Every API call in the LIBNDCTL library requires a context that holds the logging parameters and other library instance state. The library is logging parameters and other library instance state. The library is based on the libabc template: based on the libabc template: https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git LIBNDCTL: instantiate a new library context example LIBNDCTL: instantiate a new library context example Loading Loading @@ -509,7 +515,7 @@ At first glance it seems since NFIT defines just PMEM and BLK interface types that we should simply name REGION devices with something derived types that we should simply name REGION devices with something derived from those type names. However, the ND subsystem explicitly keeps the from those type names. However, the ND subsystem explicitly keeps the REGION name generic and expects userspace to always consider the REGION name generic and expects userspace to always consider the region-attributes for 4 reasons: region-attributes for four reasons: 1. There are already more than two REGION and "namespace" types. For 1. There are already more than two REGION and "namespace" types. For PMEM there are two subtypes. As mentioned previously we have PMEM where PMEM there are two subtypes. As mentioned previously we have PMEM where Loading Loading @@ -698,8 +704,8 @@ static int configure_namespace(struct ndctl_region *region, Why the Term "namespace"? Why the Term "namespace"? 1. Why not "volume" for instance? "volume" ran the risk of confusing ND 1. Why not "volume" for instance? "volume" ran the risk of confusing as a volume manager like device-mapper. ND (libnvdimm subsystem) to a volume manager like device-mapper. 2. The term originated to describe the sub-devices that can be created 2. The term originated to describe the sub-devices that can be created within a NVME controller (see the nvme specification: within a NVME controller (see the nvme specification: Loading Loading @@ -774,13 +780,14 @@ block" needs to be destroyed. Note, that to destroy a BTT the media needs to be written in raw mode. By default, the kernel will autodetect needs to be written in raw mode. By default, the kernel will autodetect the presence of a BTT and disable raw mode. This autodetect behavior the presence of a BTT and disable raw mode. This autodetect behavior can be suppressed by enabling raw mode for the namespace via the can be suppressed by enabling raw mode for the namespace via the ndctl_namespace_set_raw_mode() api. ndctl_namespace_set_raw_mode() API. Summary LIBNDCTL Diagram Summary LIBNDCTL Diagram ------------------------ ------------------------ For the given example above, here is the view of the objects as seen by the LIBNDCTL api: For the given example above, here is the view of the objects as seen by the LIBNDCTL API: +---+ +---+ |CTX| +---------+ +--------------+ +---------------+ |CTX| +---------+ +--------------+ +---------------+ +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | Loading