Skip to content
  1. Oct 10, 2019
  2. Oct 01, 2019
  3. Sep 30, 2019
  4. Sep 26, 2019
  5. Sep 24, 2019
    • Kefeng Wang's avatar
      thp: update split_huge_page_pmd() comment · 9ef258ba
      Kefeng Wang authored
      According to 78ddc534 ("thp: rename split_huge_page_pmd() to
      split_huge_pmd()"), update related comment.
      
      Link: http://lkml.kernel.org/r/20190731033406.185285-1-wangkefeng.wang@huawei.com
      
      
      Signed-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9ef258ba
    • Mike Rapoport's avatar
      mm: consolidate pgtable_cache_init() and pgd_cache_init() · 782de70c
      Mike Rapoport authored
      Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
      cache for page table allocations on several architectures that do not use
      PAGE_SIZE tables for one or more levels of the page table hierarchy.
      
      Most architectures do not implement these functions and use __weak default
      NOP implementation of pgd_cache_init().  Since there is no such default
      for pgtable_cache_init(), its empty stub is duplicated among most
      architectures.
      
      Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
      drop empty stubs of pgtable_cache_init().
      
      Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Will Deacon <will@kernel.org>		[arm64]
      Acked-by: Thomas Gleixner <tglx@linutronix.de>	[x86]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      782de70c
    • Nicholas Piggin's avatar
      mm: remove quicklist page table caches · 13224794
      Nicholas Piggin authored
      Patch series "mm: remove quicklist page table caches".
      
      A while ago Nicholas proposed to remove quicklist page table caches [1].
      
      I've rebased his patch on the curren upstream and switched ia64 and sh to
      use generic versions of PTE allocation.
      
      [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com
      
      This patch (of 3):
      
      Remove page table allocator "quicklists".  These have been around for a
      long time, but have not got much traction in the last decade and are only
      used on ia64 and sh architectures.
      
      The numbers in the initial commit look interesting but probably don't
      apply anymore.  If anybody wants to resurrect this it's in the git
      history, but it's unhelpful to have this code and divergent allocator
      behaviour for minor archs.
      
      Also it might be better to instead make more general improvements to page
      allocator if this is still so slow.
      
      Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13224794
    • Matthew Wilcox (Oracle)'s avatar
      mm: introduce compound_nr() · d8c6546b
      Matthew Wilcox (Oracle) authored
      Replace 1 << compound_order(page) with compound_nr(page).  Minor
      improvements in readability.
      
      Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8c6546b
    • Matthew Wilcox (Oracle)'s avatar
      mm: introduce page_shift() · 94ad9338
      Matthew Wilcox (Oracle) authored
      Replace PAGE_SHIFT + compound_order(page) with the new page_shift()
      function.  Minor improvements in readability.
      
      [akpm@linux-foundation.org: fix build in tce_page_is_contained()]
        Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com
      Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94ad9338
    • Aneesh Kumar K.V's avatar
      powerpc/nvdimm: use H_SCM_QUERY hcall on H_OVERLAP error · faa6d211
      Aneesh Kumar K.V authored
      
      
      Right now we force an unbind of SCM memory at drcindex on H_OVERLAP error.
      This really slows down operations like kexec where we get the H_OVERLAP
      error because we don't go through a full hypervisor re init.
      
      H_OVERLAP error for a H_SCM_BIND_MEM hcall indicates that SCM memory at
      drc index is already bound. Since we don't specify a logical memory
      address for bind hcall, we can use the H_SCM_QUERY hcall to query
      the already bound logical address.
      
      Boot time difference with and without patch is:
      
      [    5.583617] IOMMU table initialized, virtual merging enabled
      [    5.603041] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Retrying bind after unbinding
      [  301.514221] papr_scm ibm,persistent-memory:ibm,pmemory@44108001: Retrying bind after unbinding
      [  340.057238] hv-24x7: read 1530 catalog entries, created 537 event attrs (0 failures), 275 descs
      
      after fix
      
      [    5.101572] IOMMU table initialized, virtual merging enabled
      [    5.116984] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Querying SCM details
      [    5.117223] papr_scm ibm,persistent-memory:ibm,pmemory@44108001: Querying SCM details
      [    5.120530] hv-24x7: read 1530 catalog entries, created 537 event attrs (0 failures), 275 descs
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903123452.28620-2-aneesh.kumar@linux.ibm.com
      faa6d211
    • Aneesh Kumar K.V's avatar
      powerpc/nvdimm: Use HCALL error as the return value · 4111cdef
      Aneesh Kumar K.V authored
      
      
      This simplifies the error handling and also enable us to switch to
      H_SCM_QUERY hcall in a later patch on H_OVERLAP error.
      
      We also do some kernel print formatting fixup in this patch.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190903123452.28620-1-aneesh.kumar@linux.ibm.com
      4111cdef
    • Aneesh Kumar K.V's avatar
      libnvdimm/altmap: Track namespace boundaries in altmap · cf387d96
      Aneesh Kumar K.V authored
      
      
      With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
      area. Some architectures map the memmap area with large page size. On
      architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
      This maps a namespace size of 16G.
      
      When populating memmap region with 16MB page from the device area,
      make sure the allocated space is not used to map resources outside this
      namespace. Such usage of device area will prevent a namespace destroy.
      
      Add resource end pnf in altmap and use that to check if the memmap area
      allocation can map pfn outside the namespace. On ppc64 in such case we fallback
      to allocation from memory.
      
      This fix kernel crash reported below:
      
      [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
      [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
      [  133.464760] Faulting instruction address: 0xc00000000007580c
      [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
      [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
      .....
      [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
      [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
      [  133.464910] Call Trace:
      [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
      [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
      [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
      [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
      [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
      [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
      [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
      [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
      [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
      [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
      [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
      [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
      [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
      [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
      
      djbw: Aneesh notes that this crash can likely be triggered in any kernel that
      supports 'papr_scm', so flagging that commit for -stable consideration.
      
      Fixes: b5beae5e ("powerpc/pseries: Add driver for PAPR SCM regions")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: default avatarPankaj Gupta <pagupta@redhat.com>
      Tested-by: default avatarSantosh Sivaraj <santosh@fossix.org>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Link: https://lore.kernel.org/r/20190910062826.10041-1-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      cf387d96
    • Aneesh Kumar K.V's avatar
      powerpc/book3s64: Export has_transparent_hugepage() related functions. · a6f197f8
      Aneesh Kumar K.V authored
      
      
      In later patch, we want to use hash_transparent_hugepage() in a kernel module.
      Export two related functions.
      
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Link: https://lore.kernel.org/r/20190924042440.27946-1-aneesh.kumar@linux.ibm.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      a6f197f8
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 · 047e6575
      Aneesh Kumar K.V authored
      
      
      On POWER9, under some circumstances, a broadcast TLB invalidation will
      fail to invalidate the ERAT cache on some threads when there are
      parallel mtpidr/mtlpidr happening on other threads of the same core.
      This can cause stores to continue to go to a page after it's unmapped.
      
      The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie
      flush. This additional TLB flush will cause the ERAT cache
      invalidation. Since we are using PID=0 or LPID=0, we don't get
      filtered out by the TLB snoop filtering logic.
      
      We need to still follow this up with another tlbie to take care of
      store vs tlbie ordering issue explained in commit:
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9"). The presence of ERAT cache implies we can still get new
      stores and they may miss store queue marking flush.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
      047e6575
    • Aneesh Kumar K.V's avatar
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · 09ce98ca
      Aneesh Kumar K.V authored
      
      
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
      09ce98ca
    • Aneesh Kumar K.V's avatar
      powerpc/book3s64/mm: Don't do tlbie fixup for some hardware revisions · 677733e2
      Aneesh Kumar K.V authored
      
      
      The store ordering vs tlbie issue mentioned in commit
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9") is fixed for Nimbus 2.3 and Cumulus 1.3 revisions. We don't
      need to apply the fixup if we are running on them
      
      We can only do this on PowerNV. On pseries guest with KVM we still
      don't support redoing the feature fixup after migration. So we should
      be enabling all the workarounds needed, because whe can possibly
      migrate between DD 2.3 and DD 2.2
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-1-aneesh.kumar@linux.ibm.com
      677733e2
    • Laurent Dufour's avatar
      powerpc/pseries: Call H_BLOCK_REMOVE when supported · 59545ebe
      Laurent Dufour authored
      
      
      Depending on the hardware and the hypervisor, the hcall H_BLOCK_REMOVE
      may not be able to process all the page sizes for a segment base page
      size, as reported by the TLB Invalidate Characteristics.
      
      For each pair of base segment page size and actual page size, this
      characteristic tells us the size of the block the hcall supports.
      
      In the case, the hcall is not supporting a pair of base segment page
      size, actual page size, it is returning H_PARAM which leads to a panic
      like this:
      
        kernel BUG at /home/srikar/work/linux.git/arch/powerpc/platforms/pseries/lpar.c:466!
        Oops: Exception in kernel mode, sig: 5 [#1]
        BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 28 PID: 583 Comm: modprobe Not tainted 5.2.0-master #5
        NIP: c0000000000be8dc LR: c0000000000be880 CTR: 0000000000000000
        REGS: c0000007e77fb130 TRAP: 0700  Not tainted (5.2.0-master)
        MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 42224824 XER: 20000000
        CFAR: c0000000000be8fc IRQMASK: 0
        GPR00: 0000000022224828 c0000007e77fb3c0 c000000001434d00 0000000000000005
        GPR04: 9000000004fa8c00 0000000000000000 0000000000000003 0000000000000001
        GPR08: c0000007e77fb450 0000000000000000 0000000000000001 ffffffffffffffff
        GPR12: c0000007e77fb450 c00000000edfcb80 0000cd7d3ea30000 c0000000016022b0
        GPR16: 00000000000000b0 0000cd7d3ea30000 0000000000000001 c080001f04f00105
        GPR20: 0000000000000003 0000000000000004 c000000fbeb05f58 c000000001602200
        GPR24: 0000000000000000 0000000000000004 8800000000000000 c000000000c5d148
        GPR28: c000000000000000 8000000000000000 a000000000000000 c0000007e77fb580
        NIP [c0000000000be8dc] .call_block_remove+0x12c/0x220
        LR [c0000000000be880] .call_block_remove+0xd0/0x220
        Call Trace:
          0xc000000fb8c00240 (unreliable)
          .pSeries_lpar_flush_hash_range+0x578/0x670
          .flush_hash_range+0x44/0x100
          .__flush_tlb_pending+0x3c/0xc0
          .zap_pte_range+0x7ec/0x830
          .unmap_page_range+0x3f4/0x540
          .unmap_vmas+0x94/0x120
          .exit_mmap+0xac/0x1f0
          .mmput+0x9c/0x1f0
          .do_exit+0x388/0xd60
          .do_group_exit+0x54/0x100
          .__se_sys_exit_group+0x14/0x20
          system_call+0x5c/0x70
        Instruction dump:
        39400001 38a00000 4800003c 60000000 60420000 7fa9e800 38e00000 419e0014
        7d29d278 7d290074 7929d182 69270001 <0b070000> 7d495378 394a0001 7fa93040
      
      The call to H_BLOCK_REMOVE should only be made for the supported pair
      of base segment page size, actual page size and using the correct
      maximum block size.
      
      Due to the required complexity in do_block_remove() and
      call_block_remove(), and the fact that currently a block size of 8 is
      returned by the hypervisor, we are only supporting 8 size block to the
      H_BLOCK_REMOVE hcall.
      
      In order to identify this limitation easily in the code, a local
      define HBLKR_SUPPORTED_SIZE defining the currently supported block
      size, and a dedicated checking helper is_supported_hlbkr() are
      introduced.
      
      For regular pages and hugetlb, the assumption is made that the page
      size is equal to the base page size. For THP the page size is assumed
      to be 16M.
      
      Fixes: ba2dd8a2 ("powerpc/pseries/mm: call H_BLOCK_REMOVE")
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190920130523.20441-3-ldufour@linux.ibm.com
      59545ebe
    • Laurent Dufour's avatar
      powerpc/pseries: Read TLB Block Invalidate Characteristics · 1211ee61
      Laurent Dufour authored
      
      
      The PAPR document specifies the TLB Block Invalidate Characteristics
      which tells for each pair of segment base page size, actual page size,
      the size of the block the hcall H_BLOCK_REMOVE supports.
      
      These characteristics are loaded at boot time in a new table
      hblkr_size. The table is separate from the mmu_psize_def because this
      is specific to the pseries platform.
      
      A new init function, pseries_lpar_read_hblkrm_characteristics() is
      added to read the characteristics. It is called from
      pSeries_setup_arch().
      
      Fixes: ba2dd8a2 ("powerpc/pseries/mm: call H_BLOCK_REMOVE")
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190920130523.20441-2-ldufour@linux.ibm.com
      1211ee61
    • Michael Roth's avatar
      KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag · 3a83f677
      Michael Roth authored
      
      
      On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
      of memory running the following guest configs:
      
        guest A:
          - 224GB of memory
          - 56 VCPUs (sockets=1,cores=28,threads=2), where:
            VCPUs 0-1 are pinned to CPUs 0-3,
            VCPUs 2-3 are pinned to CPUs 4-7,
            ...
            VCPUs 54-55 are pinned to CPUs 108-111
      
        guest B:
          - 4GB of memory
          - 4 VCPUs (sockets=1,cores=4,threads=1)
      
      with the following workloads (with KSM and THP enabled in all):
      
        guest A:
          stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
      
        guest B:
          stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
      
        host:
          stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
      
      the below soft-lockup traces were observed after an hour or so and
      persisted until the host was reset (this was found to be reliably
      reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
      and 5.3-rc5):
      
        [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
        [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
        [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
        [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
        [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
        [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
        [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
        [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
        [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
        [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
        [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
        [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
        [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
        [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
        [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
        [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
        [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
        [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
      
      CPUs 45, 24, and 124 are stuck on spin locks, likely held by
      CPUs 105 and 31.
      
      CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
      target CPU 42. For instance:
      
        # CPU 105 registers (via xmon)
        R00 = c00000000020b20c   R16 = 00007d1bcd800000
        R01 = c00000363eaa7970   R17 = 0000000000000001
        R02 = c0000000019b3a00   R18 = 000000000000006b
        R03 = 000000000000002a   R19 = 00007d537d7aecf0
        R04 = 000000000000002a   R20 = 60000000000000e0
        R05 = 000000000000002a   R21 = 0801000000000080
        R06 = c0002073fb0caa08   R22 = 0000000000000d60
        R07 = c0000000019ddd78   R23 = 0000000000000001
        R08 = 000000000000002a   R24 = c00000000147a700
        R09 = 0000000000000001   R25 = c0002073fb0ca908
        R10 = c000008ffeb4e660   R26 = 0000000000000000
        R11 = c0002073fb0ca900   R27 = c0000000019e2464
        R12 = c000000000050790   R28 = c0000000000812b0
        R13 = c000207fff623e00   R29 = c0002073fb0ca808
        R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
        R15 = 00007d1bcd600000   R31 = 0000000000000800
        pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
        cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
        lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
        msr = 900000010288b033   cr  = 44024824
        ctr = c000000000050790   xer = 0000000000000000   trap =  100
      
      CPU 42 is running normally, doing VCPU work:
      
        # CPU 42 stack trace (via xmon)
        [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
        [c000008ed3343820] c000008ed3343850 (unreliable)
        [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
        [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
        [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
        [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
        [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
        [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
        [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
        [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
        [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
        --- Exception: c00 (System Call) at 00007d545cfd7694
        SP (7d53ff7edf50) is in userspace
      
      It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
      was set for CPU 42 by at least 1 of the CPUs waiting in
      smp_call_function_many(), but somehow the corresponding
      call_single_queue entries were never processed by CPU 42, causing the
      callers to spin in csd_lock_wait() indefinitely.
      
      Nick Piggin suggested something similar to the following sequence as
      a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
      IPI messages seems to be most common, but any mix of CALL_FUNCTION and
      !CALL_FUNCTION messages could trigger it):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
          105: smp_muxed_ipi_set_message():
          105:   smb_mb()
               // STORE DEFERRED DUE TO RE-ORDERING
        --105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |  42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        |  42: // returns to executing guest
        |      // RE-ORDERED STORE COMPLETES
        ->105:   message[CALL_FUNCTION] = 1
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      This can be prevented with an smp_mb() at the beginning of
      kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
      state indicated by the host_ipi flag) are ordered vs. the store to
      to host_ipi.
      
      However, doing so might still allow for the following scenario (not
      yet observed):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
               // STORE DEFERRED DUE TO RE-ORDERING
        -- 42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        | 105: smp_muxed_ipi_set_message():
        | 105:   smb_mb()
        | 105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |      // RE-ORDERED STORE COMPLETES
        -> 42:   kvmppc_set_host_ipi(42, 0)
           42: // returns to executing guest
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      Fixing this scenario would require an smp_mb() *after* clearing
      host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
      subsequent processing of IPI messages.
      
      To handle both cases, this patch splits kvmppc_set_host_ipi() into
      separate set/clear functions, where we execute smp_mb() prior to
      setting host_ipi flag, and after clearing host_ipi flag. These
      functions pair with each other to synchronize the sender and receiver
      sides.
      
      With that change in place the above workload ran for 20 hours without
      triggering any lock-ups.
      
      Fixes: 755563bc ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
      Signed-off-by: default avatarMichael Roth <mdroth@linux.vnet.ibm.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
      3a83f677
  6. Sep 20, 2019
  7. Sep 19, 2019
  8. Sep 18, 2019
  9. Sep 17, 2019
  10. Sep 13, 2019
Loading