Commit 9b9e2113 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull arm64 updates from Catalin Marinas:

 - KCSAN enabled for arm64.

 - Additional kselftests to exercise the syscall ABI w.r.t. SVE/FPSIMD.

 - Some more SVE clean-ups and refactoring in preparation for SME
   support (scalable matrix extensions).

 - BTI clean-ups (SYM_FUNC macros etc.)

 - arm64 atomics clean-up and codegen improvements.

 - HWCAPs for FEAT_AFP (alternate floating point behaviour) and
   FEAT_RPRESS (increased precision of reciprocal estimate and
   reciprocal square root estimate).

 - Use SHA3 instructions to speed-up XOR.

 - arm64 unwind code refactoring/unification.

 - Avoid DC (data cache maintenance) instructions when DCZID_EL0.DZP ==
   1 (potentially set by a hypervisor; user-space already does this).

 - Perf updates for arm64: support for CI-700, HiSilicon PCIe PMU,
   Marvell CN10K LLC-TAD PMU, miscellaneous clean-ups.

 - Other fixes and clean-ups; highlights: fix the handling of erratum
   1418040, correct the calculation of the nomap region boundaries,
   introduce io_stop_wc() mapped to the new DGH instruction (data
   gathering hint).

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (81 commits)
  arm64: Use correct method to calculate nomap region boundaries
  arm64: Drop outdated links in comments
  arm64: perf: Don't register user access sysctl handler multiple times
  drivers: perf: marvell_cn10k: fix an IS_ERR() vs NULL check
  perf/smmuv3: Fix unused variable warning when CONFIG_OF=n
  arm64: errata: Fix exec handling in erratum 1418040 workaround
  arm64: Unhash early pointer print plus improve comment
  asm-generic: introduce io_stop_wc() and add implementation for ARM64
  arm64: Ensure that the 'bti' macro is defined where linkage.h is included
  arm64: remove __dma_*_area() aliases
  docs/arm64: delete a space from tagged-address-abi
  arm64: Enable KCSAN
  kselftest/arm64: Add pidbench for floating point syscall cases
  arm64/fp: Add comments documenting the usage of state restore functions
  kselftest/arm64: Add a test program to exercise the syscall ABI
  kselftest/arm64: Allow signal tests to trigger from a function
  kselftest/arm64: Parameterise ptrace vector length information
  arm64/sve: Minor clarification of ABI documentation
  arm64/sve: Generalise vector length configuration prctl() for SME
  arm64/sve: Make sysctl interface for SVE reusable by SME
  ...
parents a7ac3140 945409a6
Loading
Loading
Loading
Loading
+106 −0
Original line number Diff line number Diff line
================================================
HiSilicon PCIe Performance Monitoring Unit (PMU)
================================================

On Hip09, HiSilicon PCIe Performance Monitoring Unit (PMU) could monitor
bandwidth, latency, bus utilization and buffer occupancy data of PCIe.

Each PCIe Core has a PMU to monitor multi Root Ports of this PCIe Core and
all Endpoints downstream these Root Ports.


HiSilicon PCIe PMU driver
=========================

The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe
Core id.::

  /sys/bus/event_source/hisi_pcie<sicl>_<core>

PMU driver provides description of available events and filter options in sysfs,
see /sys/bus/event_source/devices/hisi_pcie<sicl>_<core>.

The "format" directory describes all formats of the config (events) and config1
(filter options) fields of the perf_event_attr structure. The "events" directory
describes all documented events shown in perf list.

The "identifier" sysfs file allows users to identify the version of the
PMU hardware device.

The "bus" sysfs file allows users to get the bus number of Root Ports
monitored by PMU.

Example usage of perf::

  $# perf list
  hisi_pcie0_0/rx_mwr_latency/ [kernel PMU event]
  hisi_pcie0_0/rx_mwr_cnt/ [kernel PMU event]
  ------------------------------------------

  $# perf stat -e hisi_pcie0_0/rx_mwr_latency/
  $# perf stat -e hisi_pcie0_0/rx_mwr_cnt/
  $# perf stat -g -e hisi_pcie0_0/rx_mwr_latency/ -e hisi_pcie0_0/rx_mwr_cnt/

The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU.

Filter options
--------------

1. Target filter
PMU could only monitor the performance of traffic downstream target Root Ports
or downstream target Endpoint. PCIe PMU driver support "port" and "bdf"
interfaces for users, and these two interfaces aren't supported at the same
time.

-port
"port" filter can be used in all PCIe PMU events, target Root Port can be
selected by configuring the 16-bits-bitmap "port". Multi ports can be selected
for AP-layer-events, and only one port can be selected for TL/DL-layer-events.

For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of bitmap
should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4 lanes),
bit8 is set, port=0x100; if these two Root Ports are both monitored, port=0x101.

Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mwr_latency,port=0x1/ sleep 5

-bdf

"bdf" filter can only be used in bandwidth events, target Endpoint is selected
by configuring BDF to "bdf". Counter only counts the bandwidth of message
requested by target Endpoint.

For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.

Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,bdf=0x3900/ sleep 5

2. Trigger filter
Event statistics start when the first time TLP length is greater/smaller
than trigger condition. You can set the trigger condition by writing "trig_len",
and set the trigger mode by writing "trig_mode". This filter can only be used
in bandwidth events.

For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"
means statistics start when TLP length > trigger condition, "trig_mode=1"
means start when TLP length < condition.

Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5

3. Threshold filter
Counter counts when TLP length within the specified range. You can set the
threshold by writing "thr_len", and set the threshold mode by writing
"thr_mode". This filter can only be used in bandwidth events.

For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means
counter counts when TLP length >= threshold, and "thr_mode=1" means counts
when TLP length < threshold.

Example usage of perf::

  $# perf stat -e hisi_pcie0_0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
+11 −0
Original line number Diff line number Diff line
@@ -905,6 +905,17 @@ enabled, otherwise writing to this file will return ``-EBUSY``.
The default value is 8.


perf_user_access (arm64 only)
=================================

Controls user space access for reading perf event counters. When set to 1,
user space can read performance monitor counter registers directly.

The default value is 0 (access disabled).

See Documentation/arm64/perf.rst for more information.


pid_max
=======

+17 −0
Original line number Diff line number Diff line
@@ -275,6 +275,23 @@ infrastructure:
     | SVEVer                       | [3-0]   |    y    |
     +------------------------------+---------+---------+

  8) ID_AA64MMFR1_EL1 - Memory model feature register 1

     +------------------------------+---------+---------+
     | Name                         |  bits   | visible |
     +------------------------------+---------+---------+
     | AFP                          | [47-44] |    y    |
     +------------------------------+---------+---------+

  9) ID_AA64ISAR2_EL1 - Instruction set attribute register 2

     +------------------------------+---------+---------+
     | Name                         |  bits   | visible |
     +------------------------------+---------+---------+
     | RPRES                        | [7-4]   |    y    |
     +------------------------------+---------+---------+


Appendix I: Example
-------------------

+8 −0
Original line number Diff line number Diff line
@@ -251,6 +251,14 @@ HWCAP2_ECV

    Functionality implied by ID_AA64MMFR0_EL1.ECV == 0b0001.

HWCAP2_AFP

    Functionality implied by ID_AA64MFR1_EL1.AFP == 0b0001.

HWCAP2_RPRES

    Functionality implied by ID_AA64ISAR2_EL1.RPRES == 0b0001.

4. Unused AT_HWCAP bits
-----------------------

+77 −1
Original line number Diff line number Diff line
@@ -2,7 +2,10 @@

.. _perf_index:

=====================
====
Perf
====

Perf Event Attributes
=====================

@@ -88,3 +91,76 @@ exclude_host. However when using !exclude_hv there is a small blackout
window at the guest entry/exit where host events are not captured.

On VHE systems there are no blackout windows.

Perf Userspace PMU Hardware Counter Access
==========================================

Overview
--------
The perf userspace tool relies on the PMU to monitor events. It offers an
abstraction layer over the hardware counters since the underlying
implementation is cpu-dependent.
Arm64 allows userspace tools to have access to the registers storing the
hardware counters' values directly.

This targets specifically self-monitoring tasks in order to reduce the overhead
by directly accessing the registers without having to go through the kernel.

How-to
------
The focus is set on the armv8 PMUv3 which makes sure that the access to the pmu
registers is enabled and that the userspace has access to the relevant
information in order to use them.

In order to have access to the hardware counters, the global sysctl
kernel/perf_user_access must first be enabled:

.. code-block:: sh

  echo 1 > /proc/sys/kernel/perf_user_access

It is necessary to open the event using the perf tool interface with config1:1
attr bit set: the sys_perf_event_open syscall returns a fd which can
subsequently be used with the mmap syscall in order to retrieve a page of memory
containing information about the event. The PMU driver uses this page to expose
to the user the hardware counter's index and other necessary data. Using this
index enables the user to access the PMU registers using the `mrs` instruction.
Access to the PMU registers is only valid while the sequence lock is unchanged.
In particular, the PMSELR_EL0 register is zeroed each time the sequence lock is
changed.

The userspace access is supported in libperf using the perf_evsel__mmap()
and perf_evsel__read() functions. See `tools/lib/perf/tests/test-evsel.c`_ for
an example.

About heterogeneous systems
---------------------------
On heterogeneous systems such as big.LITTLE, userspace PMU counter access can
only be enabled when the tasks are pinned to a homogeneous subset of cores and
the corresponding PMU instance is opened by specifying the 'type' attribute.
The use of generic event types is not supported in this case.

Have a look at `tools/perf/arch/arm64/tests/user-events.c`_ for an example. It
can be run using the perf tool to check that the access to the registers works
correctly from userspace:

.. code-block:: sh

  perf test -v user

About chained events and counter sizes
--------------------------------------
The user can request either a 32-bit (config1:0 == 0) or 64-bit (config1:0 == 1)
counter along with userspace access. The sys_perf_event_open syscall will fail
if a 64-bit counter is requested and the hardware doesn't support 64-bit
counters. Chained events are not supported in conjunction with userspace counter
access. If a 32-bit counter is requested on hardware with 64-bit counters, then
userspace must treat the upper 32-bits read from the counter as UNKNOWN. The
'pmc_width' field in the user page will indicate the valid width of the counter
and should be used to mask the upper bits as needed.

.. Links
.. _tools/perf/arch/arm64/tests/user-events.c:
   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/arch/arm64/tests/user-events.c
.. _tools/lib/perf/tests/test-evsel.c:
   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/perf/tests/test-evsel.c
Loading