Skip to content
  1. Feb 07, 2022
    • Aaron Liu's avatar
      drm/amdgpu: check the GART table before invalidating TLB · 29ba7b16
      Aaron Liu authored
      
      
      Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send
      invalidation command to harvested SE/SA. Once invalidation command comes
      into harvested SE/SA, SE/SA has no response and system hang.
      
      This patch is to add checking if the GART table is already allocated before
      invalidating TLB. The new procedure is as following:
      1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT
         BOs can be allocated, but GART mappings are still ignored.
      2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates
         the GART backing store.
      3. Initializing the hardware, and programming the backing store into VMID0
         for all VMHUBs.
      4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with
         the GTT allocations done before it was allocated.
      
      Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAaron Liu <aaron.liu@amd.com>
      Acked-by: default avatarHuang Rui <ray.huang@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      29ba7b16
    • Aaron Liu's avatar
      drm/amdgpu: add utcl2_harvest to gc 10.3.1 · 6d53b115
      Aaron Liu authored
      
      
      Confirmed with hardware team, there is harvesting for gc 10.3.1.
      
      Signed-off-by: default avatarAaron Liu <aaron.liu@amd.com>
      Reviewed-by: default avatarHuang Rui <ray.huang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      6d53b115
    • Tao Zhou's avatar
      drm/amdgpu: fix list add issue in vram reserve · 4e781873
      Tao Zhou authored
      
      
      The parameter order in the list_add_tail is incorrect, it causes the
      reuse of ras reserved page.
      
      Signed-off-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      4e781873
    • yipechai's avatar
      Revert "drm/amdgpu: Add judgement to avoid infinite loop" · a50b0482
      yipechai authored
      
      
      The commit d5e8ff5f ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
      had fixed this defect.
      
      Revert workaround
      commit a2170b4a ("drm/amdgpu: Add judgement to avoid infinite loop").
      
      Signed-off-by: default avataryipechai <YiPeng.Chai@amd.com>
      Reviewed-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      a50b0482
    • yipechai's avatar
      drm/amdgpu: Fixed the defect of soft lock caused by infinite loop · d5e8ff5f
      yipechai authored
      
      
      1. The infinite loop case only occurs on multiple cards support
         ras functions.
      2. The explanation of root cause refer to commit 76641cbbf196
         ("drm/amdgpu: Add judgement to avoid infinite loop").
      3. Create new node to manage each unique ras instance to guarantee
         each device .ras_list is completely independent.
      4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
         interface for each ras block").
      5. The soft locked logs are as follows:
      [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
      [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
      [  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
      [  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
      [  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
      [  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
      [  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
      [  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
      [  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
      [  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
      [  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
      [  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
      [  262.166267] Call Trace:
      [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
      [  262.166529]  ? psi_task_switch+0xd2/0x250
      [  262.166537]  ? __switch_to+0x11d/0x460
      [  262.166542]  ? __switch_to_asm+0x36/0x70
      [  262.166549]  process_one_work+0x220/0x3c0
      [  262.166556]  worker_thread+0x4d/0x3f0
      [  262.166560]  ? process_one_work+0x3c0/0x3c0
      [  262.166563]  kthread+0x12b/0x150
      [  262.166568]  ? set_kthread_struct+0x40/0x40
      [  262.166571]  ret_from_fork+0x22/0x30
      
      Signed-off-by: default avataryipechai <YiPeng.Chai@amd.com>
      Reviewed-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      d5e8ff5f
    • Luben Tuikov's avatar
      drm/amdgpu: Set FRU bus for Aldebaran and Vega 20 · 00d6936d
      Luben Tuikov authored
      
      
      The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20
      ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU
      is sought through that bus "pointer" and not through the RAS bus "pointer".
      
      Cc: Roy Sun <Roy.Sun@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Fixes: 2f60dd50 ("drm/amd: Expose the FRU SMU I2C bus")
      Signed-off-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: default avatarAlex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      00d6936d
    • Rajneesh Bhardwaj's avatar
      drm/amdgpu: Fix recursive locking warning · 447c7997
      Rajneesh Bhardwaj authored
      
      
      Noticed the below warning while running a pytorch workload on vega10
      GPUs. Change to trylock to avoid conflicts with already held reservation
      locks.
      
      [  +0.000003] WARNING: possible recursive locking detected
      [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
      [  +0.000004] --------------------------------------------
      [  +0.000002] python/4822 is trying to acquire lock:
      [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000203]
                    but task is already holding lock:
      [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000017]
                    other info that might help us debug this:
      [  +0.000002]  Possible unsafe locking scenario:
      
      [  +0.000003]        CPU0
      [  +0.000002]        ----
      [  +0.000002]   lock(reservation_ww_class_mutex);
      [  +0.000004]   lock(reservation_ww_class_mutex);
      [  +0.000003]
                     *** DEADLOCK ***
      
      [  +0.000002]  May be due to missing lock nesting notation
      
      [  +0.000003] 7 locks held by python/4822:
      [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
      kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
      [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
      [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
      [  +0.000236]  #3: ffffb2b35606fd28
      (reservation_ww_class_acquire){+.+.}-{0:0}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
      [  +0.000235]  #4: ffff932cbb7181f8
      (reservation_ww_class_mutex){+.+.}-{3:3}, at:
      ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
      drm_dev_enter+0x5/0xa0 [drm]
      [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
      at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
      [  +0.000195]
                    stack backtrace:
      [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
      5.13.0-kfd-rajneesh #1030
      [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
      08/29/2018
      [  +0.000003] Call Trace:
      [  +0.000003]  dump_stack+0x6d/0x89
      [  +0.000010]  __lock_acquire+0xb93/0x1a90
      [  +0.000009]  lock_acquire+0x25d/0x2d0
      [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
      [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  ? lock_release+0x13f/0x270
      [  +0.000005]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
      [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
      [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
      [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
      [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
      [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
      [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
      [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
      [amdgpu]
      [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
      [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
      [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
      [  +0.000216]  ? lock_release+0x13f/0x270
      [  +0.000006]  ? __fget_files+0x107/0x1e0
      [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
      [  +0.000007]  do_syscall_64+0x36/0x70
      [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  +0.000007] RIP: 0033:0x7fbff90a7317
      [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
      48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
      05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
      [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
      0000000000000010
      [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
      00007fbff90a7317
      [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
      0000000000000004
      [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
      00007fbcc402d880
      [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
      00000000c0184b18
      [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
      00007fbcc402d820
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      447c7997
    • Luben Tuikov's avatar
      drm/amdgpu: Prevent random memory access in FRU code · 00b14ce0
      Luben Tuikov authored
      
      
      Prevent random memory access in the FRU EEPROM code by passing the size of
      the destination buffer to the reading routine, and reading no more than the
      size of the buffer.
      
      Cc: Kent Russell <kent.russell@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: default avatarKent Russell <kent.russell@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      00b14ce0
    • Luben Tuikov's avatar
      drm/amdgpu: Don't offset by 2 in FRU EEPROM · 3f3a24a0
      Luben Tuikov authored
      
      
      Read buffers no longer expose the I2C address, and so we don't need to
      offset by two when we get the read data.
      
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Cc: Kent Russell <kent.russell@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Fixes: bd607166 ("drm/amdgpu: Enable reading FRU chip via I2C v3")
      Signed-off-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: default avatarKent Russell <kent.russell@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      3f3a24a0
    • Luben Tuikov's avatar
      drm/amdgpu: Nerf "buff" to "buf" · 3f1e2e9d
      Luben Tuikov authored
      
      
      Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er).
      This is consistent with the rest of the kernel code.
      
      Cc: Kent Russell <kent.russell@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: default avatarHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: default avatarKent Russell <kent.russell@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      3f1e2e9d
    • Rajneesh Bhardwaj's avatar
      drm/amdkfd: CRIU Implement KFD resume ioctl · 011bbb03
      Rajneesh Bhardwaj authored
      
      
      This adds support to create userptr BOs on restore and introduces a new
      ioctl op to restart memory notifiers for the restored userptr BOs.
      When doing CRIU restore MMU notifications can happen anytime after we call
      amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
      restore process i.e. criu_resume ioctl op is received, and the process is
      ready to be resumed. This ioctl is different from other KFD CRIU ioctls
      since its called by CRIU master restore process for all the target
      processes being resumed by CRIU.
      
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarDavid Yat Sin <david.yatsin@amd.com>
      Signed-off-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      011bbb03
    • Rajneesh Bhardwaj's avatar
      drm/amdkfd: CRIU Implement KFD checkpoint ioctl · 5ccbb057
      Rajneesh Bhardwaj authored
      
      
      This adds support to discover the  buffer objects that belong to a
      process being checkpointed. The data corresponding to these buffer
      objects is returned to user space plugin running under criu master
      context which then stores this info to recreate these buffer objects
      during a restore operation.
      
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: default avatarDavid Yat Sin <david.yatsin@amd.com>
      Signed-off-by: default avatarRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      5ccbb057
    • Luben Tuikov's avatar
      drm/amdgpu: Print once if RAS unsupported · afa37315
      Luben Tuikov authored
      
      
      MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
      the dmesg log to fill up with the same message, e.g,
      
      [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function.
      
      Make it dev_dbg_once(), as it isn't something correctible during boot or
      thereafter, so printing just once is sufficient. Also sanitize the message.
      
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Tao Zhou <tao.zhou1@amd.com>
      Cc: yipechai <YiPeng.Chai@amd.com>
      Fixes: 8b0fb0e9 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops")
      Signed-off-by: default avatarLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: default avatarAlex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      afa37315
    • Christian König's avatar
      drm/amdgpu: rename amdgpu_vm_bo_rmv to _del · e56694f7
      Christian König authored
      
      
      Some people complained about the name and this matches much
      more Linux naming conventions for object functions.
      
      Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      e56694f7
    • Christian König's avatar
      drm/amdgpu: add some lockdep checks to the VM code · 2d022081
      Christian König authored
      
      
      Whenever a bo_va structure is added or removed the VM and eventually
      added BO should be locked.
      
      Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      2d022081
  2. Feb 02, 2022
  3. Jan 31, 2022
  4. Jan 27, 2022
Loading