Skip to content
Snippets Groups Projects
  1. Jun 11, 2015
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table · 00547193
      Alexey Kardashevskiy authored
      
      This adds a way for the IOMMU user to know how much a new table will
      use so it can be accounted in the locked_vm limit before allocation
      happens.
      
      This stores the allocated table size in pnv_pci_ioda2_get_table_size()
      so the locked_vm counter can be updated correctly when a table is
      being disposed.
      
      This defines an iommu_table_group_ops callback to let VFIO know
      how much memory will be locked if a table is created.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      00547193
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: powerpc/powernv/ioda: Define and implement DMA windows API · 4793d65d
      Alexey Kardashevskiy authored
      
      This extends iommu_table_group_ops by a set of callbacks to support
      dynamic DMA windows management.
      
      create_table() creates a TCE table with specific parameters.
      it receives iommu_table_group to know nodeid in order to allocate
      TCE table memory closer to the PHB. The exact format of allocated
      multi-level table might be also specific to the PHB model (not
      the case now though).
      This callback calculated the DMA window offset on a PCI bus from @num
      and stores it in a just created table.
      
      set_window() sets the window at specified TVT index + @num on PHB.
      
      unset_window() unsets the window from specified TVT.
      
      This adds a free() callback to iommu_table_ops to free the memory
      (potentially a tree of tables) allocated for the TCE table.
      
      create_table() and free() are supposed to be called once per
      VFIO container and set_window()/unset_window() are supposed to be
      called for every group in a container.
      
      This adds IOMMU capabilities to iommu_table_group such as default
      32bit window parameters and others. This makes use of new values in
      vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
      advertise pagemasks to the userspace.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4793d65d
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Implement multilevel TCE tables · bbb845c4
      Alexey Kardashevskiy authored
      
      TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
      on huge guests (hundreds of GB of RAM) so the kernel might be unable to
      allocate contiguous chunk of physical memory to store the TCE table.
      
      To address this, POWER8 CPU (actually, IODA2) supports multi-level
      TCE tables, up to 5 levels which splits the table into a tree of
      smaller subtables.
      
      This adds multi-level TCE tables support to
      pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages()
      helpers.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bbb845c4
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/powernv: Release replaced TCE · 05c6cfb9
      Alexey Kardashevskiy authored
      
      At the moment writing new TCE value to the IOMMU table fails with EBUSY
      if there is a valid entry already. However PAPR specification allows
      the guest to write new TCE value without clearing it first.
      
      Another problem this patch is addressing is the use of pool locks for
      external IOMMU users such as VFIO. The pool locks are to protect
      DMA page allocator rather than entries and since the host kernel does
      not control what pages are in use, there is no point in pool locks and
      exchange()+put_page(oldtce) is sufficient to avoid possible races.
      
      This adds an exchange() callback to iommu_table_ops which does the same
      thing as set() plus it returns replaced TCE and DMA direction so
      the caller can release the pages afterwards. The exchange() receives
      a physical address unlike set() which receives linear mapping address;
      and returns a physical address as the clear() does.
      
      This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
      for a platform to have exchange() implemented in order to support VFIO.
      
      This replaces iommu_tce_build() and iommu_clear_tce() with
      a single iommu_tce_xchg().
      
      This makes sure that TCE permission bits are not set in TCE passed to
      IOMMU API as those are to be calculated by platform code from
      DMA direction.
      
      This moves SetPageDirty() to the IOMMU code to make it work for both
      VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
      available later).
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      05c6cfb9
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control · f87a8864
      Alexey Kardashevskiy authored
      
      This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
      which call in a loop iommu_take_ownership()/iommu_release_ownership()
      for every table on the group. As there is just one now, no change in
      behaviour is expected.
      
      At the moment the iommu_table struct has a set_bypass() which enables/
      disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
      which calls this callback when external IOMMU users such as VFIO are
      about to get over a PHB.
      
      The set_bypass() callback is not really an iommu_table function but
      IOMMU/PE function. This introduces a iommu_table_group_ops struct and
      adds take_ownership()/release_ownership() callbacks to it which are
      called when an external user takes/releases control over the IOMMU.
      
      This replaces set_bypass() with ownership callbacks as it is not
      necessarily just bypass enabling, it can be something else/more
      so let's give it more generic name.
      
      The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
      IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
      The following patches will replace iommu_take_ownership/
      iommu_release_ownership calls in IODA2 with full IOMMU table release/
      create.
      
      As we here and touching bypass control, this removes
      pnv_pci_ioda2_setup_bypass_pe() as it does not do much
      more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
      initialization to pnv_pci_ioda2_setup_dma_pe.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f87a8864
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group · 0eaf4def
      Alexey Kardashevskiy authored
      
      So far one TCE table could only be used by one IOMMU group. However
      IODA2 hardware allows programming the same TCE table address to
      multiple PE allowing sharing tables.
      
      This replaces a single pointer to a group in a iommu_table struct
      with a linked list of groups which provides the way of invalidating
      TCE cache for every PE when an actual TCE table is updated. This adds
      pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
      helpers to manage the list. However without VFIO, it is still going
      to be a single IOMMU group per iommu_table.
      
      This changes iommu_add_device() to add a device to a first group
      from the group list of a table as it is only called from the platform
      init code or PCI bus notifier and at these moments there is only
      one group per table.
      
      This does not change TCE invalidation code to loop through all
      attached groups in order to simplify this patch and because
      it is not really needed in most cases. IODA2 is fixed in a later
      patch.
      
      This should cause no behavioural change.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0eaf4def
    • Alexey Kardashevskiy's avatar
      powerpc/spapr: vfio: Replace iommu_table with iommu_table_group · b348aa65
      Alexey Kardashevskiy authored
      
      Modern IBM POWERPC systems support multiple (currently two) TCE tables
      per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
      for TCE tables. Right now just one table is supported.
      
      This defines iommu_table_group struct which stores pointers to
      iommu_group and iommu_table(s). This replaces iommu_table with
      iommu_table_group where iommu_table was used to identify a group:
      - iommu_register_group();
      - iommudata of generic iommu_group;
      
      This removes @data from iommu_table as it_table_group provides
      same access to pnv_ioda_pe.
      
      For IODA, instead of embedding iommu_table, the new iommu_table_group
      keeps pointers to those. The iommu_table structs are allocated
      dynamically.
      
      For P5IOC2, both iommu_table_group and iommu_table are embedded into
      PE struct. As there is no EEH and SRIOV support for P5IOC2,
      iommu_free_table() should not be called on iommu_table struct pointers
      so we can keep it embedded in pnv_phb::p5ioc2.
      
      For pSeries, this replaces multiple calls of kzalloc_node() with a new
      iommu_pseries_alloc_group() helper and stores the table group struct
      pointer into the pci_dn struct. For release, a iommu_table_free_group()
      helper is added.
      
      This moves iommu_table struct allocation from SR-IOV code to
      the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
      pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
      This change is here because those lines had to be changed anyway.
      
      This should cause no behavioural change.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b348aa65
    • Alexey Kardashevskiy's avatar
      powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table · da004c36
      Alexey Kardashevskiy authored
      
      This adds a iommu_table_ops struct and puts pointer to it into
      the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
      callbacks from ppc_md to the new struct where they really belong to.
      
      This adds the requirement for @it_ops to be initialized before calling
      iommu_init_table() to make sure that we do not leave any IOMMU table
      with iommu_table_ops uninitialized. This is not a parameter of
      iommu_init_table() though as there will be cases when iommu_init_table()
      will not be called on TCE tables, for example - VFIO.
      
      This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_"
      redundant prefixes.
      
      This removes tce_xxx_rm handlers from ppc_md but does not add
      them to iommu_table_ops as this will be done later if we decide to
      support TCE hypercalls in real mode. This removes _vm callbacks as
      only virtual mode is supported by now so this also removes @rm parameter.
      
      For pSeries, this always uses tce_buildmulti_pSeriesLP/
      tce_buildmulti_pSeriesLP. This changes multi callback to fall back to
      tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
      present. The reason for this is we still have to support "multitce=off"
      boot parameter in disable_multitce() and we do not want to walk through
      all IOMMU tables in the system and replace "multi" callbacks with single
      ones.
      
      For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2.
      This makes the callbacks for them public. Later patches will extend
      callbacks for IODA1/2.
      
      No change in behaviour is expected.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      da004c36
    • Alexey Kardashevskiy's avatar
      powerpc/powernv: Do not set "read" flag if direction==DMA_NONE · 10b35b2b
      Alexey Kardashevskiy authored
      
      Normally a bitmap from the iommu_table is used to track what TCE entry
      is in use. Since we are going to use iommu_table without its locks and
      do xchg() instead, it becomes essential not to put bits which are not
      implied in the direction flag as the old TCE value (more precisely -
      the permission bits) will be used to decide whether to put the page or not.
      
      This adds iommu_direction_to_tce_perm() (its counterpart is there already)
      and uses it for powernv's pnv_tce_build().
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      10b35b2b
    • Alexey Kardashevskiy's avatar
      vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver · 9b14a1ff
      Alexey Kardashevskiy authored
      
      This moves page pinning (get_user_pages_fast()/put_page()) code out of
      the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
      to as the platform code does not deal with page pinning.
      
      This makes iommu_take_ownership()/iommu_release_ownership() deal with
      the IOMMU table bitmap only.
      
      This removes page unpinning from iommu_take_ownership() as the actual
      TCE table might contain garbage and doing put_page() on it is undefined
      behaviour.
      
      Besides the last part, the rest of the patch is mechanical.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      [aw: for the vfio related changes]
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9b14a1ff
    • Alexey Kardashevskiy's avatar
      powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group · 4617082e
      Alexey Kardashevskiy authored
      
      The set_iommu_table_base_and_group() name suggests that the function
      sets table base and add a device to an IOMMU group.
      
      The actual purpose for table base setting is to put some reference
      into a device so later iommu_add_device() can get the IOMMU group
      reference and the device to the group.
      
      At the moment a group cannot be explicitly passed to iommu_add_device()
      as we want it to work from the bus notifier, we can fix it later and
      remove confusing calls of set_iommu_table_base().
      
      This replaces set_iommu_table_base_and_group() with a couple of
      set_iommu_table_base() + iommu_add_device() which makes reading the code
      easier.
      
      This adds few comments why set_iommu_table_base() and iommu_add_device()
      are called where they are called.
      
      For IODA1/2, this essentially removes iommu_add_device() call from
      the pnv_pci_ioda_dma_dev_setup() as it will always fail at this particular
      place:
      - for physical PE, the device is already attached by iommu_add_device()
      in pnv_pci_ioda_setup_dma_pe();
      - for virtual PE, the sysfs entries are not ready to create all symlinks
      so actual adding is happening in tce_iommu_bus_notifier.
      
      Signed-off-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: default avatarGavin Shan <gwshan@linux.vnet.ibm.com>
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      4617082e
  2. Jun 10, 2015
    • Aneesh Kumar K.V's avatar
      powerpc/mm: Add trace point for tracking hash pte fault · cfcb3d80
      Aneesh Kumar K.V authored
      
      This enables us to understand how many hash fault we are taking
      when running benchmarks.
      
      For ex:
      -bash-4.2# ./perf stat -e  powerpc:hash_fault -e page-faults /tmp/ebizzy.ppc64 -S 30  -P -n 1000
      ...
      
       Performance counter stats for '/tmp/ebizzy.ppc64 -S 30 -P -n 1000':
      
             1,10,04,075      powerpc:hash_fault
             1,10,03,429      page-faults
      
            30.865978991 seconds time elapsed
      
      NOTE:
      The impact of the tracepoint was not noticeable when running test. It was
      within the run-time variance of the test. For ex:
      
      without-patch:
      --------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.236562      task-clock (msec)         #    0.928 CPUs utilized
      	 2,179,213      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,174,367      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007794658 seconds time elapsed
      
      And with-patch:
      ---------------
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.089 M/sec
      	  7.233746      task-clock (msec)         #    0.921 CPUs utilized
      		 0      context-switches          #    0.000 K/sec
      
             0.007854876 seconds time elapsed
      
       Performance counter stats for './a.out 3000 300':
      
      	       643      page-faults               #    0.087 M/sec
      	       649      powerpc:hash_fault        #    0.087 M/sec
      	  7.430376      task-clock (msec)         #    0.938 CPUs utilized
      	 2,347,174      stalled-cycles-frontend   #    0.00% frontend cycles idle
      	17,524,282      stalled-cycles-backend    #    0.00% backend  cycles idle
      		 0      context-switches          #    0.000 K/sec
      
             0.007920284 seconds time elapsed
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cfcb3d80
  3. Jun 07, 2015
  4. Jun 04, 2015
  5. Jun 03, 2015
  6. Jun 02, 2015
  7. May 22, 2015
    • Daniel Axtens's avatar
      powerpc: Add MSI operations to pci_controller_ops struct · e059b105
      Daniel Axtens authored
      
      Add MSI setup and teardown functions to pci_controller_ops.
      
      Patch the callsites (arch_{setup,teardown}_msi_irqs) to prefer the
      controller ops version if it's available.
      
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e059b105
    • Alistair Popple's avatar
      powerpc/powernv: Add a virtual irqchip for opal events · 9f0fd049
      Alistair Popple authored
      
      Whenever an interrupt is received for opal the linux kernel gets a
      bitfield indicating certain events that have occurred and need handling
      by the various device drivers. Currently this is handled using a
      notifier interface where we call every device driver that has
      registered to receive opal events.
      
      This approach has several drawbacks. For example each driver has to do
      its own checking to see if the event is relevant as well as event
      masking. There is also no easy method of recording the number of times
      we receive particular events.
      
      This patch solves these issues by exposing opal events via the
      standard interrupt APIs by adding a new interrupt chip and
      domain. Drivers can then register for the appropriate events using
      standard kernel calls such as irq_of_parse_and_map().
      
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      9f0fd049
    • Alistair Popple's avatar
      powerpc/powernv: Reorder OPAL subsystem initialisation · 96e023e7
      Alistair Popple authored
      
      Most of the OPAL subsystems are always compiled in for PowerNV and
      many of them need to be initialised before or after other OPAL
      subsystems. Rather than trying to control this ordering through
      machine initcalls it is clearer and easier to control initialisation
      order with explicit calls in opal_init.
      
      Signed-off-by: default avatarAlistair Popple <alistair@popple.id.au>
      Cc: Mahesh Jagannath Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      96e023e7
    • Shreyas B. Prabhu's avatar
      powerpc/powernv: Introduce sysfs control for fastsleep workaround behavior · 5703d2f4
      Shreyas B. Prabhu authored
      
      Fastsleep is one of the idle state which cpuidle subsystem currently
      uses on power8 machines. In this state L2 cache is brought down to a
      threshold voltage. Therefore when the core is in fastsleep, the
      communication between L2 and L3 needs to be fenced. But there is a bug
      in the current power8 chips surrounding this fencing.
      
      OPAL provides a workaround which precludes the possibility of hitting
      this bug. But running with this workaround applied causes checkstop
      if any correctable error in L2 cache directory is detected. Hence OPAL
      also provides a way to undo the workaround.
      
      In the existing implementation, workaround is applied by the last thread
      of the core entering fastsleep and undone by the first thread waking up.
      But this has a performance cost. These OPAL calls account for roughly
      4000 cycles everytime the core has to enter or wakeup from fastsleep.
      
      This patch introduces a sysfs attribute (fastsleep_workaround_applyonce)
      to choose the behavior of this workaround.
      
      By default, fastsleep_workaround_applyonce = 0. In this case, workaround
      is applied/undone everytime the core enters/exits fastsleep.
      
      fastsleep_workaround_applyonce = 1. In this case the workaround is
      applied once on all the cores and never undone. This can be triggered by
      echo 1 > /sys/devices/system/cpu/fastsleep_workaround_applyonce
      
      For simplicity this attribute can be modified only once. Implying, once
      fastsleep_workaround_applyonce is changed to 1, it cannot be reverted
      to the default state.
      
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      5703d2f4
    • Shreyas B. Prabhu's avatar
      powerpc: Fix cpu_online_cores_map to return only online threads mask · e602ffb2
      Shreyas B. Prabhu authored
      
      Currently, cpu_online_cores_map returns a mask, which for every core with
      at least one online thread, has the bit for thread 0 of the core set to 1,
      and the bits for all other threads of the core set to 0. But thread 0 of
      the core itself may not be online always. In such cases, if the returned
      mask is used for IPI, then it'll cause IPIs to be skipped on cores where
      the first thread is offline, because the IPI code refuses to send IPIs to
      offline threads.
      
      Fix this by setting the bit of the first online thread in the core.
      This is done by fixing this in the underlying function
      cpu_thread_mask_to_cores.
      
      The result has the property that for all cores with online threads, there
      is one bit set in the returned map. And further, all bits that are set in
      the returned map correspond to online threads.
      
      Signed-off-by: default avatarShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      [ Changelog from Michael Ellerman <mpe@ellerman.id.au> ]
      Reviewed-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e602ffb2
  8. May 20, 2015
    • Laurent Dufour's avatar
      powerpc: Enable sys_kcmp() for CRIU · 7978f76c
      Laurent Dufour authored
      
      The commit 8170a83f ("powerpc: Wireup the kcmp syscall to sys_ni") has
      disabled the kcmp syscall for powerpc.  This has been done due to the use
      of unsigned long parameters which may require a dedicated wrapper to handle
      32bit process on top of 64bit kernel.  However in the kcmp() case, the 2
      unsigned long parameters are currently only used to carry file descriptors
      from user space to the kernel.  Since such a parameter is passed through
      register, and file descriptor doesn't need to get extended, there is,
      today, no need for a wrapper.
      
      In the case there will be a need to pass address in or out of this system
      call, then a wrapper could be required, it will then be to care of it.
      
      As today this is not the case, it is safe to enable kcmp() on powerpc.
      
      Tested (by Laurent) on 64-bit, 32-bit, and 32-bit userspace on 64-bit
      kernel using tools/testing/selftests/kcmp [mpe].
      
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      7978f76c
  9. May 12, 2015
  10. May 11, 2015
  11. Apr 30, 2015
    • Michael Ellerman's avatar
      Revert "powerpc/tm: Abort syscalls in active transactions" · 68fc378c
      Michael Ellerman authored
      
      This reverts commit feba4036.
      
      Although the principle of this change is good, the implementation has a
      few issues.
      
      Firstly we can sometimes fail to abort a syscall because r12 may have
      been clobbered by C code if we went down the virtual CPU accounting
      path, or if syscall tracing was enabled.
      
      Secondly we have decided that it is safer to abort the syscall even
      earlier in the syscall entry path, so that we avoid the syscall tracing
      path when we are transactional.
      
      So that we have time to thoroughly test those changes we have decided to
      revert this for this merge window and will merge the fixed version in
      the next window.
      
      NB. Rather than reverting the selftest we just drop tm-syscall from
      TEST_PROGS so that it's not run by default.
      
      Fixes: feba4036 ("powerpc/tm: Abort syscalls in active transactions")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      68fc378c
  12. Apr 21, 2015
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C · eddb60fb
      Paul Mackerras authored
      
      This replaces the assembler code for kvmhv_commence_exit() with C code
      in book3s_hv_builtin.c.  It also moves the IPI sending code that was
      in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
      can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      eddb60fb
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras authored
      
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      7d6c40da
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken · 5d5b99cd
      Paul Mackerras authored
      
      We can tell when a secondary thread has finished running a guest by
      the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
      is no real need for the nap_count field in the kvmppc_vcore struct.
      This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
      pointers of the secondary threads rather than polling vc->nap_count.
      Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
      this also means that we can tell which secondary threads have got
      stuck and thus print a more informative error message.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      5d5b99cd
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu · 25fedfca
      Paul Mackerras authored
      
      Rather than calling cond_resched() in kvmppc_run_core() before doing
      the post-processing for the vcpus that we have just run (that is,
      calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do
      that post-processing before calling cond_resched(), and that post-
      processing is moved out into its own function, post_guest_process().
      
      The reschedule point is now in kvmppc_run_vcpu() and we define a new
      vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner
      task is runnable but not running.  (Doing the reschedule with the
      vcore in VCORE_INACTIVE state would be bad because there are potentially
      other vcpus waiting for the runner in kvmppc_wait_for_exec() which
      then wouldn't get woken up.)
      
      Also, we make use of the handy cond_resched_lock() function, which
      unlocks and relocks vc->lock for us around the reschedule.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      25fedfca
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras authored
      
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      1f09c3ed
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update · d911f0be
      Paul Mackerras authored
      
      Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
      update (i.e. one of its 3 virtual processor areas needed to be pinned
      in memory so the host real mode code can update it on guest entry and
      exit), we would drop the vcore lock and do the update there and then.
      Future changes will make it inconvenient to drop the lock, so instead
      we now remove it from the list of runnable VCPUs and wake up its
      VCPU task.  This will have the effect that the VCPU task will exit
      kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
      re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
      to kvmppc_update_vpas() and then rejoin the vcore.
      
      The one complication is that the runner VCPU (whose VCPU task is the
      current task) might be one of the ones that gets removed from the
      runnable list.  In that case we just return from kvmppc_run_core()
      and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
      the runner if necessary.
      
      This all means that the VCORE_STARTING state is no longer used, so we
      remove it.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      d911f0be
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras authored
      
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      b6c295df
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT · e23a808b
      Paul Mackerras authored
      
      This creates a debugfs directory for each HV guest (assuming debugfs
      is enabled in the kernel config), and within that directory, a file
      by which the contents of the guest's HPT (hashed page table) can be
      read.  The directory is named vmnnnn, where nnnn is the PID of the
      process that created the guest.  The file is named "htab".  This is
      intended to help in debugging problems in the host's management
      of guest memory.
      
      The contents of the file consist of a series of lines like this:
      
        3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
      
      The first field is the index of the entry in the HPT, the second and
      third are the HPT entry, so the third entry contains the real page
      number that is mapped by the entry if the entry's valid bit is set.
      The fourth field is the guest's view of the second doubleword of the
      entry, so it contains the guest physical address.  (The format of the
      second through fourth fields are described in the Power ISA and also
      in arch/powerpc/include/asm/mmu-hash64.h.)
      
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      e23a808b
    • Aneesh Kumar K.V's avatar
      KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte · a4bd6eb0
      Aneesh Kumar K.V authored
      
      This adds helper routines for locking and unlocking HPTEs, and uses
      them in the rest of the code.  We don't change any locking rules in
      this patch.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@samba.org>
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      a4bd6eb0
Loading