Commits · 2bbafda405c04cfed1b57b761d13ada3154c0f89 · jan.koester / Linux

Jun 15, 2021

libnvdimm: Drop unused device power management support · 2bbafda4

Dan Williams authored Jun 15, 2021

LIBNVDIMM device objects register sysfs power attributes despite nothing
requiring that support. Clean up sysfs remove the power/ attribute
group. This requires a device_create() and a device_register() usage to
be converted to the device_initialize() + device_add() pattern.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162379910795.2993820.10130417680551632288.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

2bbafda4

libnvdimm: Export nvdimm shutdown helper, nvdimm_delete() · fd14602d

Dan Williams authored Jun 15, 2021

CXL is a hotplug bus and arranges for nvdimm devices to be dynamically
discovered and removed. The libnvdimm core manages shutdown of nvdimm
security operations when the device is unregistered. That functionality
is moved to nvdimm_delete() and invoked by the CXL-to-nvdimm glue code.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162379910271.2993820.2955889139842401250.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

fd14602d

cxl/pmem: Add initial infrastructure for pmem support · 8fdcb170

Dan Williams authored Jun 15, 2021

Register an 'nvdimm-bridge' device to act as an anchor for a libnvdimm
bus hierarchy. Also, flesh out the cxl_bus definition to allow a
cxl_nvdimm_bridge_driver to attach to the bridge and trigger the
nvdimm-bus registration.

The creation of the bridge is gated on the detection of a PMEM capable
address space registered to the root. The bridge indirection allows the
libnvdimm module to remain unloaded on platforms without PMEM support.

Given that the probing of ACPI0017 is asynchronous to CXL endpoint
devices, and the expectation that CXL endpoint devices register other
PMEM resources on the 'CXL' nvdimm bus, a workqueue is added. The
workqueue is needed to run bus_rescan_devices() outside of the
device_lock() of the nvdimm-bridge device to rendezvous nvdimm resources
as they arrive. For now only the bus is taken online/offline in the
workqueue.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162379909706.2993820.14051258608641140169.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

8fdcb170

cxl/core: Add cxl-bus driver infrastructure · 6af7139c

Dan Williams authored Jun 15, 2021



Enable devices on the 'cxl' bus to be attached to drivers. The initial
user of this functionality is a driver for an 'nvdimm-bridge' device
that anchors a libnvdimm hierarchy attached to CXL persistent memory
resources. Other device types that will leverage this include:

cxl_port: map and use component register functionality (HDM Decoders)

cxl_nvdimm: translate CXL memory expander endpoints to libnvdimm
	    'nvdimm' objects

cxl_region: translate CXL interleave sets to libnvdimm 'region' objects

The pairing of devices to drivers is handled through the cxl_device_id()
matching to cxl_driver.id values. A cxl_device_id() of '0' indicates no
driver support.

In addition to ->match(), ->probe(), and ->remove() support for the
'cxl' bus introduce MODULE_ALIAS_CXL() to autoload modules containing
cxl-drivers. Drivers are added in follow-on changes.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162379909190.2993820.6134168109678004186.stgit@dwillia2-desk3.amr.corp.intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6af7139c

cxl/pci: Add media provisioning required commands · 87815ee9

Ben Widawsky authored Apr 13, 2021

Some of the commands have already been defined for the support of RAW
commands (to be blocked). Unlike their usage in the RAW interface, when
used through the supported interface, they will be coordinated and
marshalled along with other commands being issued by userspace and the
driver itself. That coordination will be added later.

The list of commands was determined based on the learnings from
libnvdimm and this list is provided directly from Dan.

Recommended-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20210413140907.534404-1-ben.widawsky@intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

87815ee9

Jun 12, 2021

cxl/component_regs: Fix offset · ba268647

Ben Widawsky authored Jun 10, 2021

The CXL.cache and CXL.mem registers begin after the CXL.io registers
which occupy the first 0x1000 bytes. The current code wasn't setting
this up properly for future users of the component registers. It was
correct for the probing code however.

Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Fixes: 08422378 ("cxl/pci: Add HDM decoder capabilities")
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/20210611051113.224328-1-ben.widawsky@intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

ba268647

cxl/hdm: Fix decoder count calculation · 6423035f

Ben Widawsky authored Jun 11, 2021



The decoder count in the HDM decoder capability structure is an encoded
field. As defined in the spec:

Decoder Count: Reports the number of memory address decoders implemented
by the component.
0 – 1 Decoder
1 – 2 Decoders
2 – 4 Decoders
3 – 6 Decoders
4 – 8 Decoders
5 – 10 Decoders
All other values are reserved

Nothing is actually fixed by this as nothing actually used this mapping
yet.

Cc: Ira Weiny <ira.weiny@intel.com>
Fixes: 08422378 ("cxl/pci: Add HDM decoder capabilities")
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/20210611190111.121295-1-ben.widawsky@intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

6423035f

Jun 10, 2021

cxl/acpi: Introduce cxl_decoder objects · 40ba17af

Dan Williams authored Jun 09, 2021



A cxl_decoder is a child of a cxl_port. It represents a hardware decoder
configuration of an upstream port to one or more of its downstream
ports. The decoder is either represented in CXL standard HDM decoder
registers (see CXL 2.0 section 8.2.5.12 CXL HDM Decoder Capability
Structure), or it is a static decode configuration communicated by
platform firmware (see the CXL Early Discovery Table: Fixed Memory
Window Structure).

The firmware described and hardware described decoders differ slightly
leading to 2 different sub-types of decoders, cxl_decoder_root and
cxl_decoder_switch. At the root level the decode capabilities restrict
what can be mapped beneath them. Mid-level switch decoders are
configured for either acclerator (type-2) or memory-expander (type-3)
operation, but they are otherwise agnostic to the type of memory
(volatile vs persistent) being mapped.

Here is an example topology from a single-ported host-bridge environment
without CFMWS decodes enumerated.

    /sys/bus/cxl/devices/root0
    ├── devtype
    ├── dport0 -> ../../../LNXSYSTM:00/LNXSYBUS:00/ACPI0016:00
    ├── port1
    │   ├── decoder1.0
    │   │   ├── devtype
    │   │   ├── locked
    │   │   ├── size
    │   │   ├── start
    │   │   ├── subsystem -> ../../../../../../bus/cxl
    │   │   ├── target_list
    │   │   ├── target_type
    │   │   └── uevent
    │   ├── devtype
    │   ├── dport0 -> ../../../../pci0000:34/0000:34:00.0
    │   ├── subsystem -> ../../../../../bus/cxl
    │   ├── uevent
    │   └── uport -> ../../../../LNXSYSTM:00/LNXSYBUS:00/ACPI0016:00
    ├── subsystem -> ../../../../bus/cxl
    ├── uevent
    └── uport -> ../../ACPI0017:00

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325695128.2293823.17519927266014762694.stgit@dwillia2-desk3.amr.corp.intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

40ba17af

cxl/acpi: Enumerate host bridge root ports · 3b94ce7b

Dan Williams authored Jun 09, 2021

While the resources enumerated by the CEDT.CFMWS identify a cxl_port
with host bridges as downstream ports, host bridges themselves are
upstream ports that decode to downstream ports represented by PCIe Root
Ports. Walk the PCIe Root Ports connected to a CXL Host Bridge,
identified by the ACPI0016 _HID, and add each one as a cxl_dport of the
host bridge cxl_port.

For now, component registers are not enumerated, only the first order
uport / dport relationships.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325451145.2293126.10149150938788969381.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

3b94ce7b

cxl/acpi: Add downstream port data to cxl_port instances · 7d4b5ca2

Dan Williams authored Jun 09, 2021

In preparation for infrastructure that enumerates and configures the CXL
decode mechanism of an upstream port to its downstream ports, add a
representation of a CXL downstream port.

On ACPI systems the top-most logical downstream ports in the hierarchy
are the host bridges (ACPI0016 devices) that decode the memory windows
described by the CXL Early Discovery Table Fixed Memory Window
Structures (CEDT.CFMWS).

Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Link: https://lore.kernel.org/r/162325450624.2293126.3533006409920271718.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

7d4b5ca2

cxl/Kconfig: Default drivers to CONFIG_CXL_BUS · 3feaa2d3

Dan Williams authored Jun 09, 2021

CONFIG_CXL_BUS is default 'n' as expected for new functionality. When
that is enabled do not make the end user hunt for all the expected
sub-options to enable. For example CONFIG_CXL_BUS without CONFIG_CXL_MEM
is an odd/expert configuration, so is CONFIG_CXL_MEM without
CONFIG_CXL_ACPI (on ACPI capable platforms). Default CONFIG_CXL_MEM and
CONFIG_CXL_ACPI to CONFIG_CXL_BUS.

Acked-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325450105.2293126.17046356425194082921.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

3feaa2d3

cxl/acpi: Introduce the root of a cxl_port topology · 4812be97

Dan Williams authored Jun 09, 2021

While CXL builds upon the PCI software model for enumeration and
endpoint control, a static platform component is required to bootstrap
the CXL memory layout. Similar to how ACPI identifies root-level PCI
memory resources, ACPI data enumerates the address space and interleave
configuration for CXL Memory.

In addition to identifying host bridges, ACPI is responsible for
enumerating the CXL memory space that can be addressed by downstream
decoders. This is similar to the requirement for ACPI to publish
resources via the _CRS method for PCI host bridges. Specifically, ACPI
publishes a table, CXL Early Discovery Table (CEDT), which includes a
list of CXL Memory resources, CXL Fixed Memory Window Structures
(CFMWS).

For now, introduce the core infrastructure for a cxl_port hierarchy
starting with a root level anchor represented by the ACPI0017 device.

Follow on changes model support for the configurable decode capabilities
of cxl_port instances, i.e. CXL switch support.

Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325449515.2293126.15303270193010154608.stgit@dwillia2-desk3.amr.corp.intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

4812be97

Jun 07, 2021

ACPICA: Use ACPI_FALLTHROUGH · b5e77403

Wei Ming Chen authored Jun 04, 2021

ACPICA commit 2296edd39b4ce2d2dd691c1f309c4da00843ecc9

Replace /* FALLTHROUGH */ comment with ACPI_FALLTHROUGH

Link: https://github.com/acpica/acpica/commit/2296edd3


Signed-off-by: Wei Ming Chen <jj251510319013@gmail.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

b5e77403

ACPICA: Fix memory leak caused by _CID repair function · c27bac03

Erik Kaneda authored Jun 04, 2021

ACPICA commit 180cb53963aa876c782a6f52cc155d951b26051a

According to the ACPI spec, _CID returns a package containing
hardware ID's. Each element of an ASL package contains a reference
count from the parent package as well as the element itself.

Name (TEST, Package() {
    "String object" // this package element has a reference count of 2
})

A memory leak was caused in the _CID repair function because it did
not decrement the reference count created by the package. Fix the
memory leak by calling acpi_ut_remove_reference on _CID package elements
that represent a hardware ID (_HID).

Link: https://github.com/acpica/acpica/commit/180cb539


Tested-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Erik Kaneda <erik.kaneda@intel.com>
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

c27bac03

Jun 06, 2021

cxl/pci: Fixup devm_cxl_iomap_block() to take a 'struct device *' · 605a5e41

Dan Williams authored May 27, 2021



The expectation is that devm functions take 'struct device *' and pci
functions take 'struct pci_dev *'. Swap out the @pdev argument for @dev
and fixup related helpers.

Cc: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162216592374.3833641.13281743585064451514.stgit@dwillia2-desk3.amr.corp.intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

605a5e41

cxl/pci: Add HDM decoder capabilities · 08422378

Ben Widawsky authored May 27, 2021

An HDM decoder is defined in the CXL 2.0 specification as a mechanism
that allow devices and upstream ports to claim memory address ranges and
participate in interleave sets. HDM decoder registers are within the
component register block defined in CXL 2.0 8.2.3 CXL 2.0 Component
Registers as part of the CXL.cache and CXL.mem subregion.

The Component Register Block is found via the Register Locator DVSEC
in a similar fashion to how the CXL Device Register Block is found. The
primary difference is the capability id size of the Component Register
Block is a single DWORD instead of 4 DWORDS.

It's now possible to configure a CXL type 3 device's HDM decoder. Such
programming is expected for CXL devices with persistent memory, and hot
plugged CXL devices that participate in CXL.mem with volatile memory.

Add probe and mapping functions for the component register blocks.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Co-developed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/20210528004922.3980613-6-ira.weiny@intel.com

Signed-off-by: Dan Williams <dan.j.williams@intel.com>

08422378

cxl/pci: Reserve individual register block regions · 9a016527

Ira Weiny authored Jun 03, 2021



Some hardware implementations mix component and device registers into
the same BAR and the driver stack is going to need independent mapping
implementations for those 2 cases.  Furthermore, it will be nice to have
finer grained mappings should user space want to map some register
blocks.

Now that individual register blocks are mapped; those blocks regions
should be reserved individually to fully separate the register blocks.

Release the 'global' memory reservation and create individual register
block region reservations through devm.

NOTE: pci_release_mem_regions() is still compatible with
pcim_enable_device() because it removes the automatic region release
when called.  So preserve the pcim_enable_device() so that the pcim
interface can be called if needed.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210604005316.4187340-1-ira.weiny@intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

9a016527

cxl/pci: Map registers based on capabilities · 30af9729

Ira Weiny authored Jun 03, 2021



The information required to map registers based on capabilities is
contained within the bars themselves.  This means the bar must be mapped
to read the information needed and then unmapped to map the individual
parts of the BAR based on capabilities.

Change cxl_setup_device_regs() to return a new cxl_register_map, change
the name to cxl_probe_device_regs().  Allocate and place
cxl_register_maps on a list while processing all of the specified
register blocks.

After probing all the register blocks go back and map smaller registers
blocks based on their capabilities and dispose of the cxl_register_maps.

NOTE: pci_iomap() is not managed automatically via pcim_enable_device()
so be careful to call pci_iounmap() correctly.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210604005036.4187184-1-ira.weiny@intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

30af9729

cxl/pci: Reserve all device regions at once · f8a7e8c2

Ira Weiny authored May 27, 2021



In order to remap individual register sets each bar region must be
reserved prior to mapping.  Because the details of individual register
sets are contained within the BARs themselves, the bar must be mapped 2
times, once to extract this information and a second time for each
register set.

Rather than attempt to reserve each BAR individually and track if that
bar has been reserved.  Open code pcim_iomap_regions() by first
reserving all memory regions on the device and then mapping the bars
individually as needed.

NOTE pci_request_mem_regions() does not need a corresponding
pci_release_mem_regions() because the pci device is managed via
pcim_enable_device().

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210528004922.3980613-3-ira.weiny@intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

f8a7e8c2

cxl/pci: Introduce cxl_decode_register_block() · 07d62eac

Ira Weiny authored May 27, 2021



Each register block located in the DVSEC needs to be decoded from 2
words, 'register offset high' and 'register offset low'.

Create a function, cxl_decode_register_block() to perform this decode
and return the bar, offset, and register type of the register block.

Then use the values decoded in cxl_mem_map_regblock() instead of passing
the raw registers.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20210528004922.3980613-2-ira.weiny@intel.com


Signed-off-by: Dan Williams <dan.j.williams@intel.com>

07d62eac

Jun 05, 2021

drivers/base/memory: fix trying offlining memory blocks with memory holes on aarch64 · 92813053

David Hildenbrand authored Jun 04, 2021

offline_pages() properly checks for memory holes and bails out.
However, we do a page_zone(pfn_to_page(start_pfn)) before calling
offline_pages() when offlining a memory block.

We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on
aarch64 in offlining code, otherwise we can trigger a BUG when hitting a
memory hole:

   kernel BUG at include/linux/mm.h:1383!
   Internal error: Oops - BUG: 0 [#1] SMP
   Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class
   CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4
   Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
   pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
   pc : memory_subsys_offline+0x1f8/0x250
   lr : memory_subsys_offline+0x1f8/0x250
   Call trace:
     memory_subsys_offline+0x1f8/0x250
     device_offline+0x154/0x1d8
     online_store+0xa4/0x118
     dev_attr_store+0x44/0x78
     sysfs_kf_write+0xe8/0x138
     kernfs_fop_write_iter+0x26c/0x3d0
     new_sync_write+0x2bc/0x4f8
     vfs_write+0x718/0xc88
     ksys_write+0xf8/0x1e0
     __arm64_sys_write+0x74/0xa8
     invoke_syscall.constprop.0+0x78/0x1e8
     do_el0_svc+0xe4/0x298
     el0_svc+0x20/0x30
     el0_sync_handler+0xb0/0xb8
     el0_sync+0x178/0x180
   Kernel panic - not syncing: Oops - BUG: Fatal exception
   SMP: stopping secondary CPUs
   Kernel Offset: disabled
   CPU features: 0x00000251,20000846
   Memory Limit: none

If nr_vmemmap_pages is set, we know that we are dealing with hotplugged
memory that doesn't have any holes.  So call
page_zone(pfn_to_page(start_pfn)) only when really necessary -- when
nr_vmemmap_pages is set and we actually adjust the present pages.

Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com


Fixes: a08a2ae3 ("mm,memory_hotplug: allocate memmap from the added memory range")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

92813053

Jun 04, 2021

cxgb4: avoid link re-train during TC-MQPRIO configuration · 3822d067

Rahul Lakkireddy authored Jun 04, 2021



When configuring TC-MQPRIO offload, only turn off netdev carrier and
don't bring physical link down in hardware. Otherwise, when the
physical link is brought up again after configuration, it gets
re-trained and stalls ongoing traffic.

Also, when firmware is no longer accessible or crashed, avoid sending
FLOWC and waiting for reply that will never come.

Fix following hung_task_timeout_secs trace seen in these cases.

INFO: task tc:20807 blocked for more than 122 seconds.
      Tainted: G S                5.13.0-rc3+ #122
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:tc   state:D stack:14768 pid:20807 ppid: 19366 flags:0x00000000
Call Trace:
 __schedule+0x27b/0x6a0
 schedule+0x37/0xa0
 schedule_preempt_disabled+0x5/0x10
 __mutex_lock.isra.14+0x2a0/0x4a0
 ? netlink_lookup+0x120/0x1a0
 ? rtnl_fill_ifinfo+0x10f0/0x10f0
 __netlink_dump_start+0x70/0x250
 rtnetlink_rcv_msg+0x28b/0x380
 ? rtnl_fill_ifinfo+0x10f0/0x10f0
 ? rtnl_calcit.isra.42+0x120/0x120
 netlink_rcv_skb+0x4b/0xf0
 netlink_unicast+0x1a0/0x280
 netlink_sendmsg+0x216/0x440
 sock_sendmsg+0x56/0x60
 __sys_sendto+0xe9/0x150
 ? handle_mm_fault+0x6d/0x1b0
 ? do_user_addr_fault+0x1c5/0x620
 __x64_sys_sendto+0x1f/0x30
 do_syscall_64+0x3c/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7f73218321
RSP: 002b:00007ffd19626208 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 000055b7c0a8b240 RCX: 00007f7f73218321
RDX: 0000000000000028 RSI: 00007ffd19626210 RDI: 0000000000000003
RBP: 000055b7c08680ff R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000055b7c085f5f6
R13: 000055b7c085f60a R14: 00007ffd19636470 R15: 00007ffd196262a0

Fixes: b1396c2b ("cxgb4: parse and configure TC-MQPRIO offload")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3822d067

wireguard: allowedips: free empty intermediate nodes when removing single node · bf7b042d

Jason A. Donenfeld authored Jun 04, 2021

When removing single nodes, it's possible that that node's parent is an
empty intermediate node, in which case, it too should be removed.
Otherwise the trie fills up and never is fully emptied, leading to
gradual memory leaks over time for tries that are modified often. There
was originally code to do this, but was removed during refactoring in
2016 and never reworked. Now that we have proper parent pointers from
the previous commits, we can implement this properly.

In order to reduce branching and expensive comparisons, we want to keep
the double pointer for parent assignment (which lets us easily chain up
to the root), but we still need to actually get the parent's base
address. So encode the bit number into the last two bits of the pointer,
and pack and unpack it as needed. This is a little bit clumsy but is the
fastest and less memory wasteful of the compromises. Note that we align
the root struct here to a minimum of 4, because it's embedded into a
larger struct, and we're relying on having the bottom two bits for our
flag, which would only be 16-bit aligned on m68k.

The existing macro-based helpers were a bit unwieldy for adding the bit
packing to, so this commit replaces them with safer and clearer ordinary
functions.

We add a test to the randomized/fuzzer part of the selftests, to free
the randomized tries by-peer, refuzz it, and repeat, until it's supposed
to be empty, and then then see if that actually resulted in the whole
thing being emptied. That combined with kmemcheck should hopefully make
sure this commit is doing what it should. Along the way this resulted in
various other cleanups of the tests and fixes for recent graphviz.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bf7b042d

wireguard: allowedips: allocate nodes in kmem_cache · dc680de2

Jason A. Donenfeld authored Jun 04, 2021



The previous commit moved from O(n) to O(1) for removal, but in the
process introduced an additional pointer member to a struct that
increased the size from 60 to 68 bytes, putting nodes in the 128-byte
slab. With deployed systems having as many as 2 million nodes, this
represents a significant doubling in memory usage (128 MiB -> 256 MiB).
Fix this by using our own kmem_cache, that's sized exactly right. This
also makes wireguard's memory usage more transparent in tools like
slabtop and /proc/slabinfo.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc680de2

wireguard: allowedips: remove nodes in O(1) · f634f418

Jason A. Donenfeld authored Jun 04, 2021



Previously, deleting peers would require traversing the entire trie in
order to rebalance nodes and safely free them. This meant that removing
1000 peers from a trie with a half million nodes would take an extremely
long time, during which we're holding the rtnl lock. Large-scale users
were reporting 200ms latencies added to the networking stack as a whole
every time their userspace software would queue up significant removals.
That's a serious situation.

This commit fixes that by maintaining a double pointer to the parent's
bit pointer for each node, and then using the already existing node list
belonging to each peer to go directly to the node, fix up its pointers,
and free it with RCU. This means removal is O(1) instead of O(n), and we
don't use gobs of stack.

The removal algorithm has the same downside as the code that it fixes:
it won't collapse needlessly long runs of fillers.  We can enhance that
in the future if it ever becomes a problem. This commit documents that
limitation with a TODO comment in code, a small but meaningful
improvement over the prior situation.

Currently the biggest flaw, which the next commit addresses, is that
because this increases the node size on 64-bit machines from 60 bytes to
68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up
using twice as much memory per node, because of power-of-two
allocations, which is a big bummer. We'll need to figure something out
there.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f634f418

wireguard: allowedips: initialize list head in selftest · 46cfe8ee

Jason A. Donenfeld authored Jun 04, 2021



The randomized trie tests weren't initializing the dummy peer list head,
resulting in a NULL pointer dereference when used. Fix this by
initializing it in the randomized trie test, just like we do for the
static unit test.

While we're at it, all of the other strings like this have the word
"self-test", so add it to the missing place here.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46cfe8ee

wireguard: peer: allocate in kmem_cache · a4e9f8e3

Jason A. Donenfeld authored Jun 04, 2021



With deployments having upwards of 600k peers now, this somewhat heavy
structure could benefit from more fine-grained allocations.
Specifically, instead of using a 2048-byte slab for a 1544-byte object,
we can now use 1544-byte objects directly, thus saving almost 25%
per-peer, or with 600k peers, that's a savings of 303 MiB. This also
makes wireguard's memory usage more transparent in tools like slabtop
and /proc/slabinfo.

Fixes: 8b5553ac ("wireguard: queueing: get rid of per-peer ring buffers")
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4e9f8e3

wireguard: use synchronize_net rather than synchronize_rcu · 24b70eee

Jason A. Donenfeld authored Jun 04, 2021



Many of the synchronization points are sometimes called under the rtnl
lock, which means we should use synchronize_net rather than
synchronize_rcu. Under the hood, this expands to using the expedited
flavor of function in the event that rtnl is held, in order to not stall
other concurrent changes.

This fixes some very, very long delays when removing multiple peers at
once, which would cause some operations to take several minutes.

Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

24b70eee

wireguard: do not use -O3 · cc5060ca

Jason A. Donenfeld authored Jun 04, 2021

Apparently, various versions of gcc have O3-related miscompiles. Looking
at the difference between -O2 and -O3 for gcc 11 doesn't indicate
miscompiles, but the difference also doesn't seem so significant for
performance that it's worth risking.

Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/
Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: e7096c13 ("net: WireGuard secure network tunnel")
Cc: stable@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cc5060ca

i2c: qcom-geni: Suspend and resume the bus during SYSTEM_SLEEP_PM ops · 57648e86

Roja Rani Yarubandi authored May 25, 2021



Mark bus as suspended during system suspend to block the future
transfers. Implement geni_i2c_resume_noirq() to resume the bus.

Fixes: 37692de5 ("i2c: i2c-qcom-geni: Add bus driver for the Qualcomm GENI I2C controller")
Signed-off-by: Roja Rani Yarubandi <rojay@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>

57648e86

i2c: qcom-geni: Add shutdown callback for i2c · 9f78c607

Roja Rani Yarubandi authored May 25, 2021



If the hardware is still accessing memory after SMMU translation
is disabled (as part of smmu shutdown callback), then the
IOVAs (I/O virtual address) which it was using will go on the bus
as the physical addresses which will result in unknown crashes
like NoC/interconnect errors.

So, implement shutdown callback for i2c driver to suspend the bus
during system "reboot" or "shutdown".

Fixes: 37692de5 ("i2c: i2c-qcom-geni: Add bus driver for the Qualcomm GENI I2C controller")
Signed-off-by: Roja Rani Yarubandi <rojay@codeaurora.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>

9f78c607

ice: Allow all LLDP packets from PF to Tx · f9f83202

Dave Ertman authored May 05, 2021



Currently in the ice driver, the check whether to
allow a LLDP packet to egress the interface from the
PF_VSI is being based on the SKB's priority field.
It checks to see if the packets priority is equal to
TC_PRIO_CONTROL.  Injected LLDP packets do not always
meet this condition.

SCAPY defaults to a sk_buff->protocol value of ETH_P_ALL
(0x0003) and does not set the priority field.  There will
be other injection methods (even ones used by end users)
that will not correctly configure the socket so that
SKB fields are correctly populated.

Then ethernet header has to have to correct value for
the protocol though.

Add a check to also allow packets whose ethhdr->h_proto
matches ETH_P_LLDP (0x88CC).

Fixes: 0c3a6101 ("ice: Allow egress control packets from PF_VSI")
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

f9f83202

ice: report supported and advertised autoneg using PHY capabilities · 5cd349c3

Paul Greenwalt authored May 05, 2021

Ethtool incorrectly reported supported and advertised auto-negotiation
settings for a backplane PHY image which did not support auto-negotiation.
This can occur when using media or PHY type for reporting ethtool
supported and advertised auto-negotiation settings.

Remove setting supported and advertised auto-negotiation settings based
on PHY type in ice_phy_type_to_ethtool(), and MAC type in
ice_get_link_ksettings().

Ethtool supported and advertised auto-negotiation settings should be
based on the PHY image using the AQ command get PHY capabilities with
media. Add setting supported and advertised auto-negotiation settings
based get PHY capabilities with media in ice_get_link_ksettings().

Fixes: 48cb27f2 ("ice: Implement handlers for ethtool PHY/link operations")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

5cd349c3

ice: handle the VF VSI rebuild failure · c7ee6ce1

Haiyue Wang authored Feb 26, 2021



VSI rebuild can be failed for LAN queue config, then the VF's VSI will
be NULL, the VF reset should be stopped with the VF entering into the
disable state.

Fixes: 12bb018c ("ice: Refactor VF reset")
Signed-off-by: Haiyue Wang <haiyue.wang@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

c7ee6ce1

ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared · 8679f07a

Brett Creeley authored Feb 26, 2021



Some AVF drivers expect the VF_MBX_ATQLEN register to be cleared for any
type of VFR/VFLR. Fix this by clearing the VF_MBX_ATQLEN register at the
same time as VF_MBX_ARQLEN.

Fixes: 82ba0128 ("ice: clear VF ARQLEN register on reset")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

8679f07a

ice: Fix allowing VF to request more/less queues via virtchnl · f0457690

Brett Creeley authored Feb 26, 2021



Commit 12bb018c ("ice: Refactor VF reset") caused a regression
that removes the ability for a VF to request a different amount of
queues via VIRTCHNL_OP_REQUEST_QUEUES. This prevents VF drivers to
either increase or decrease the number of queue pairs they are
allocated. Fix this by using the variable vf->num_req_qs when
determining the vf->num_vf_qs during VF VSI creation.

Fixes: 12bb018c ("ice: Refactor VF reset")
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

f0457690

Jun 03, 2021

virtio-net: fix for skb_over_panic inside big mode · 1a802423

Xuan Zhuo authored Jun 04, 2021



In virtio-net's large packet mode, there is a hole in the space behind
buf.

    hdr_padded_len - hdr_len

We must take this into account when calculating tailroom.

[   44.544385] skb_put.cold (net/core/skbuff.c:5254 (discriminator 1) net/core/skbuff.c:5252 (discriminator 1))
[   44.544864] page_to_skb (drivers/net/virtio_net.c:485) [   44.545361] receive_buf (drivers/net/virtio_net.c:849 drivers/net/virtio_net.c:1131)
[   44.545870] ? netif_receive_skb_list_internal (net/core/dev.c:5714)
[   44.546628] ? dev_gro_receive (net/core/dev.c:6103)
[   44.547135] ? napi_complete_done (./include/linux/list.h:35 net/core/dev.c:5867 net/core/dev.c:5862 net/core/dev.c:6565)
[   44.547672] virtnet_poll (drivers/net/virtio_net.c:1427 drivers/net/virtio_net.c:1525)
[   44.548251] __napi_poll (net/core/dev.c:6985)
[   44.548744] net_rx_action (net/core/dev.c:7054 net/core/dev.c:7139)
[   44.549264] __do_softirq (./arch/x86/include/asm/jump_label.h:19 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:560)
[   44.549762] irq_exit_rcu (kernel/softirq.c:433 kernel/softirq.c:637 kernel/softirq.c:649)
[   44.551384] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 13))
[   44.551991] ? asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)
[   44.552654] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:638)

Fixes: fb32856b ("virtio-net: page_to_skb() use build_skb when there's sufficient tailroom")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reported-by: Corentin Noël <corentin.noel@collabora.com>
Tested-by: Corentin Noël <corentin.noel@collabora.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1a802423

cxgb4: fix regression with HASH tc prio value update · a27fb314

Rahul Lakkireddy authored Jun 02, 2021

commit db43b30c ("cxgb4: add ethtool n-tuple filter deletion")
has moved searching for next highest priority HASH filter rule to
cxgb4_flow_rule_destroy(), which searches the rhashtable before the
the rule is removed from it and hence always finds at least 1 entry.
Fix by removing the rule from rhashtable first before calling
cxgb4_flow_rule_destroy() and hence avoid fetching stale info.

Fixes: db43b30c ("cxgb4: add ethtool n-tuple filter deletion")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a27fb314

Bluetooth: btusb: Fix failing to init controllers with operation firmware · 1f14a620

Luiz Augusto von Dentz authored Apr 30, 2021



Some firmware when operation don't may have broken versions leading to
error like the following:

[    6.176482] Bluetooth: hci0: Firmware revision 0.0 build 121 week 7 2021
[    6.177906] bluetooth hci0: Direct firmware load for intel/ibt-20-0-0.sfi failed with error -2
[    6.177910] Bluetooth: hci0: Failed to load Intel firmware file intel/ibt-20-0-0.sfi (-2)

Since we load the firmware file just to check if its version had changed
comparing to the one already loaded we can just skip since the firmware
is already operation.

Fixes: ac056546 ("Bluetooth: btintel: Check firmware version before
download")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

1f14a620

i2c: tegra-bpmp: Demote kernel-doc abuses · de2646f3

Lee Jones authored May 20, 2021

Fixes the following W=1 kernel build warning(s):

drivers/i2c/busses/i2c-tegra-bpmp.c:86: warning: Function parameter or member 'i2c' not described in 'tegra_bpmp_serialize_i2c_msg'
drivers/i2c/busses/i2c-tegra-bpmp.c:86: warning: Function parameter or member 'request' not described in 'tegra_bpmp_serialize_i2c_msg'
drivers/i2c/busses/i2c-tegra-bpmp.c:86: warning: Function parameter or member 'msgs' not described in 'tegra_bpmp_serialize_i2c_msg'
drivers/i2c/busses/i2c-tegra-bpmp.c:86: warning: Function parameter or member 'num' not described in 'tegra_bpmp_serialize_i2c_msg'
drivers/i2c/busses/i2c-tegra-bpmp.c:86: warning: expecting prototype for The serialized I2C format is simply the following(). Prototype was for tegra_bpmp_serialize_i2c_msg() instead
drivers/i2c/busses/i2c-tegra-bpmp.c:130: warning: Function parameter or member 'i2c' not described in 'tegra_bpmp_i2c_deserialize'
drivers/i2c/busses/i2c-tegra-bpmp.c:130: warning: Function parameter or member 'response' not described in 'tegra_bpmp_i2c_deserialize'
drivers/i2c/busses/i2c-tegra-bpmp.c:130: warning: Function parameter or member 'msgs' not described in 'tegra_bpmp_i2c_deserialize'
drivers/i2c/busses/i2c-tegra-bpmp.c:130: warning: Function parameter or member 'num' not described in 'tegra_bpmp_i2c_deserialize'
drivers/i2c/busses/i2c-tegra-bpmp.c:130: warning: expecting prototype for The data in the BPMP(). Prototype was for tegra_bpmp_i2c_deserialize() instead

Signed-off-by: Lee Jones <lee.jones@linaro.org>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>

de2646f3