| Age | Commit message (Collapse) | Author | Lines |
|
To avoid race condition and avoid UAF cases, implement kref
based queues and protect the below operations using xa lock
a. Getting a queue from xarray
b. Increment/Decrement it's refcount
Every time some one want to access a queue, always get via
amdgpu_userq_get to make sure we have locks in place and get
the object if active.
A userqueue is destroyed on the last refcount is dropped which
typically would be via IOCTL or during fini.
v2: Add the missing drop in one the condition in the signal ioclt [Alex]
v3: remove the queue from the xarray first in the free queue ioctl path
[Christian]
- Pass queue to the amdgpu_userq_put directly.
- make amdgpu_userq_put xa_lock free since we are doing put for each get
only and final put is done via destroy and we remove the queue from xa
with lock.
- use userq_put in fini too so cleanup is done fully.
v4: Use xa_erase directly rather than doing load and erase in free
ioctl. Also remove some of the error logs which could be exploited
by the user to flood the logs [Christian]
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4952189b284d4d847f92636bb42dd747747129c0)
Cc: <stable@vger.kernel.org> # 048c1c4e5171: drm/amdgpu/userq: Consolidate wait ioctl exit path
Cc: <stable@vger.kernel.org>
|
|
If we gate the fence destruction with a check telling us whether there are
valid pointers in there we can eliminate the need for dual, basically
identical, exit paths.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bea29bb0dd29012949cd44fdb122465a9fd5cf91)
|
|
The reason the RAP is not granting access to 0x58200 is that
a dedicated RSMU slot would have to be spent for this address range,
and MPASP is close to running out of RSMU slots.
This will help to fix PSP TOC load failure during secureboot.
GFX Driver Need to use indirect access for SMN address regs.
Signed-off-by: sguttula <suresh.guttula@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9b822e26eea3899003aa8a89d5e2c4408e066e20)
|
|
Replace non-atomic vm->process_info assignment with cmpxchg()
to prevent race when parent/child processes sharing a drm_file
both try to acquire the same VM after fork().
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c7c573275ec20db05be769288a3e3bb2250ec618)
Cc: stable@vger.kernel.org
|
|
This will set DPG flags for enabling power gating on GFX11_5_4
Signed-off-by: sguttula <suresh.guttula@amd.com>
Reviewed-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a503c266d70d3363ba6bffb883cd6ecdb092670c)
|
|
A workaround was introduced in commit 1fb710793ce2 ("drm/amdgpu: Enable
MES lr_compute_wa by default") to help with some hangs observed in gfx1151.
This WA didn't fully fix the issue. It was actually fixed by adjusting
the VGPR size to the correct value that matched the hardware in commit
b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151").
There are reports of instability on other products with newer GC microcode
versions, and I believe they're caused by this workaround. As we don't
need the workaround any more, remove it.
Fixes: b42f3bf9536c ("drm/amdkfd: bump minimum vgpr size for gfx1151")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9973e64bd6ee7642860a6f3b6958cbf14e89cabd)
Cc: stable@vger.kernel.org
|
|
If the device has not recovered after slot reset is called, it goes to
out label for error handling. There it could make decision based on
uninitialized hive pointer and could result in accessing an uninitialized
list.
Initialize the list and hive properly so that it handles the error
situation and also releases the reset domain lock which is acquired
during error_detected callback.
Fixes: 732c6cefc1ec ("drm/amdgpu: Replace tmp_adev with hive in amdgpu_pci_slot_reset")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Ce Sun <cesun102@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit bb71362182e59caa227e4192da5a612b09349696)
|
|
This will set AMDGPU_VCN_SMU_DPM_INTERFACE_* smu_type
based on soc type and fixing ring timeout issue seen
for DPM enabled case.
Signed-off-by: sguttula <suresh.guttula@amd.com>
Reviewed-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f0f23c315b38c55e8ce9484cf59b65811f350630)
|
|
Do not unlock psp->ras_context.mutex if it has not been locked. This has
been detected by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: YiPeng Chai <YiPeng.Chai@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: b3fb79cda568 ("drm/amdgpu: add mutex to protect ras shared memory")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 6fa01b4335978051d2cd80841728fd63cc597970)
|
|
Mutexes must be unlocked before these are destroyed. This has been detected
by the Clang thread-safety analyzer.
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Yang Wang <kevinyang.wang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Fixes: f5e4cc8461c4 ("drm/amdgpu: implement RAS ACA driver framework")
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 270258ba320beb99648dceffb67e86ac76786e55)
|
|
Huge input values in amdgpu_userq_wait_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
v2: squash in Srini's fix
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fcec012c664247531aed3e662f4280ff804d1476)
Cc: stable@vger.kernel.org
|
|
Huge input values in amdgpu_userq_signal_ioctl can lead to a OOM and
could be exploited.
So check these input value against AMDGPU_USERQ_MAX_HANDLES
which is big enough value for genuine use cases and could
potentially avoid OOM.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit be267e15f99bc97cbe202cd556717797cdcf79a5)
Cc: stable@vger.kernel.org
|
|
Userspace can either deliberately pass in the too small num_fences, or the
required number can legitimately grow between the two calls to the userq
wait ioctl. In both cases we do not want the emit the kernel warning
backtrace since nothing is wrong with the kernel and userspace will simply
get an errno reported back. So lets simply drop the WARN_ONs.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2c333ea579de6cc20ea7bc50e9595ef72863e65c)
|
|
Drop reference to syncobj and timeline fence when aborting the ioctl due
output array being too small.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Fixes: a292fdecd728 ("drm/amdgpu: Implement userqueue signal/wait IOCTL")
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 68951e9c3e6bb22396bc42ef2359751c8315dd27)
Cc: <stable@vger.kernel.org> # v6.16+
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Pull drm fixes from Dave Airlie:
"This is the fixes and cleanups for the end of the merge window, it's
nearly all amdgpu, with some amdkfd, then a pagemap core fix, i915/xe
display fixes, and some xe driver fixes.
Nothing seems out of the ordinary, except amdgpu is a little more
volume than usual.
pagemap:
- drm/pagemap: pass pagemap_addr by reference
amdgpu:
- DML 2.1 fixes
- Panel replay fixes
- Display writeback fixes
- MES 11 old firmware compat fix
- DC CRC improvements
- DPIA fixes
- XGMI fixes
- ASPM fix
- SMU feature bit handling fixes
- DC LUT fixes
- RAS fixes
- Misc memory leak in error path fixes
- SDMA queue reset fixes
- PG handling fixes
- 5 level GPUVM page table fix
- SR-IOV fix
- Queue reset fix
- SMU 13.x fixes
- DC resume lag fix
- MPO fixes
- DCN 3.6 fix
- VSDB fixes
- HWSS clean up
- Replay fixes
- DCE cursor fixes
- DCN 3.5 SR DDR5 latency fixes
- HPD fixes
- Error path unwind fixes
- SMU13/14 mode1 reset fixes
- PSP 15 updates
- SMU 15 updates
- Sync fix in amdgpu_dma_buf_move_notify()
- HAINAN fix
- PSP 13.x fix
- GPUVM locking fix
- Fixes for DC analog support
- DC FAMS fixes
- DML 2.1 fixes
- eDP fixes
- Misc DC fixes
- Fastboot fix
- 3DLUT fixes
- GPUVM fixes
- 64bpp format fix
- Fix for MacBooks with switchable gfx
amdkfd:
- Fix possible double deletion of validate list
- Event setup fix
- Device disconnect regression fix
- APU GTT as VRAM fix
- Fix piority inversion with MQDs
- NULL check fix
radeon:
- HAINAN fix
i915/xe display:
- Regresion fix for HDR 4k displays (#15503)
- Fixup for Dell XPS 13 7390 eDP rate limit
- Memory leak fix on ACPI _DSM handling
- Add missing slice count check during DP mode validation
xe:
- drm/xe: Prevent VFs from exposing the CCS mode sysfs file
- SRIOV related fixes
- PAT cache fix
- MMIO read fix
- W/a fixes
- Adjust type of xe_modparam.force_vram_bar_size
- Wedge mode fix
- HWMon fix
* tag 'drm-next-2026-02-21' of https://gitlab.freedesktop.org/drm/kernel: (143 commits)
drm/amd/display: Remove unneeded DAC link encoder register
drm/amd/display: Enable DAC in DCE link encoder
drm/amd/display: Set CRTC source for DAC using registers
drm/amd/display: Initialize DAC in DCE link encoder using VBIOS
drm/amd/display: Turn off DAC in DCE link encoder using VBIOS
drm/amd/display: Don't call find_analog_engine() twice
drm/amdgpu: fix 4-level paging if GMC supports 57-bit VA v2
drm/amdgpu: keep vga memory on MacBooks with switchable graphics
drm/amdgpu: Set atomics to true for xgmi
drm/amdkfd: Check for NULL return values
drm/amd/display: Use same max plane scaling limits for all 64 bpp formats
drm/amdgpu: Set vmid0 PAGE_TABLE_DEPTH for GFX12.1
drm/amdkfd: Disable MQD queue priority
drm/amd/display: Remove conditional for shaper 3DLUT power-on
drm/amd/display: Check return of shaper curve to HW format
drm/amd/display: Correct logic check error for fastboot
drm/amd/display: Skip eDP detection when no sink
Revert "drm/amd/display: Add Gfx Base Case For Linear Tiling Handling"
Revert "drm/amd/display: Correct hubp GfxVersion verification"
Revert "drm/amd/display: Add Handling for gfxversion DcGfxBase"
...
|
|
It turned that using 4 level page tables on GMC generations which support
57bit VAs actually doesn't work at all.
Background is that the GMC actually can't switch between 4 and 5 levels,
but rather just uses a subset of address space when less than 5 levels are
selected.
Philip already removed the automatically switch to 4levels, now fix it as
well should it be enabled by module parameters.
v2: fix AMDGPU_GMC_HOLE_MASK as well, fix off by one issue pointed out
by Philip
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Philip Yang <philip.yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On Intel MacBookPros with switchable graphics, when the iGPU
is enabled, the address of VRAM gets put at 0 in the dGPU's
virtual address space. This is non-standard and seems to cause
issues with the cursor if it ends up at 0. We have the framework
to reserve memory at 0 in the address space, so enable it here if
the vram start address is 0.
Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4302
Cc: stable@vger.kernel.org
Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
xgmi support atomics between links. Set them to true. This only set for
GFX12 onwards to avoid regression on older generations
v2: Use correct xgmi flag that indicates CPU connection
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch fixes issues when the code moves forward with a potential
NULL pointer, without checking.
Removed one redundant NULL check for a function parameter. This check
is already done in the only caller.
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
GFX12.1 uses 2 level gart table. Set the context register appropriately
Signed-off-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The VM was not locked in the past since we initially only cleared the
linked list element and not added it to any VM state.
But this has changed quite some time ago, we just never realized this
problem because the VM state lock was masking it.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add support for psp v13_0_15 ip block
Signed-off-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Set the family for GC 11.5.4
Fixes: 47ae1f938d12 ("drm/amdgpu: add support for GC IP version 11.5.4")
Cc: Tim Huang <tim.huang@amd.com>
Cc: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Cc: Roman Li <Roman.Li@amd.com>
Reviewed-by: Tim Huang <tim.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Invalidating a dmabuf will impact other users of the shared BO.
In the scenario where process A moves the BO, it needs to inform
process B about the move and process B will need to update its
page table.
The commit fixes a synchronisation bug caused by the use of the
ticket: it made amdgpu_vm_handle_moved behave as if updating
the page table immediately was correct but in this case it's not.
An example is the following scenario, with 2 GPUs and glxgears
running on GPU0 and Xorg running on GPU1, on a system where P2P
PCI isn't supported:
glxgears:
export linear buffer from GPU0 and import using GPU1
submit frame rendering to GPU0
submit tiled->linear blit
Xorg:
copy of linear buffer
The sequence of jobs would be:
drm_sched_job_run # GPU0, frame rendering
drm_sched_job_queue # GPU0, blit
drm_sched_job_done # GPU0, frame rendering
drm_sched_job_run # GPU0, blit
move linear buffer for GPU1 access #
amdgpu_dma_buf_move_notify -> update pt # GPU0
It this point the blit job on GPU0 is still running and would
likely produce a page fault.
Cc: stable@vger.kernel.org
Fixes: a448cb003edc ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
These definitions are used by user APIs.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
end the function flow when ras table checksum is error
Signed-off-by: Gangliang Xie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Tune the sleep interval in the PSP fence wait loop from 10-100us to
60-100us.This adjustment results in an overall wait window of 1.2s
(60us * 20000 iterations) to 2 seconds (100us * 20000 iterations),
which guarantees that we can retrieve the correct fence value
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Enable GFXOff for GC 11.5.4
Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Set the default reset method to mode2 for SMU 15.0.0.
Signed-off-by: Kanala Ramalingeswara Reddy <Kanala.RamalingeswaraReddy@amd.com>
Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
TOC and TA both are required
Signed-off-by: Pratik Vishwakarma <Pratik.Vishwakarma@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Send unload command to smu during modprobe -r amdgpu for smu 13/14.
1. This can fix the high voltage/temperatue issue after driver is unloaded.
2. Reloading driver could fail but with the debug port based mode1 reset
during driver is reloaded, it is good and safe.
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
amdgpu_ib_schedule() returns early after calling amdgpu_ring_undo().
This skips the common free_fence cleanup path. Other error paths were
already changed to use goto free_fence, but this one was missed.
Change the early return to goto free_fence so all error paths clean up
the same way.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c:232 amdgpu_ib_schedule()
warn: missing unwind goto?
drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c
124 int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned int num_ibs,
125 struct amdgpu_ib *ibs, struct amdgpu_job *job,
126 struct dma_fence **f)
127 {
...
224
225 if (ring->funcs->insert_start)
226 ring->funcs->insert_start(ring);
227
228 if (job) {
229 r = amdgpu_vm_flush(ring, job, need_pipe_sync);
230 if (r) {
231 amdgpu_ring_undo(ring);
--> 232 return r;
The patch changed the other error paths to goto free_fence but
this one was accidentally skipped.
233 }
234 }
235
236 amdgpu_ring_ib_begin(ring);
...
338
339 free_fence:
340 if (!job)
341 kfree(af);
342 return r;
343 }
Fixes: f903b85ed0f1 ("drm/amdgpu: fix possible fence leaks from job structure")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
|
|
Pull drm updates from Dave Airlie:
"Highlights:
- amdgpu support for lots of new IP blocks which means newer GPUs
- xe has a lot of SR-IOV and SVM improvements
- lots of intel display refactoring across i915/xe
- msm has more support for gen8 platforms
- Given up on kgdb/kms integration, it's too hard on modern hw
core:
- drop kgdb support
- replace system workqueue with percpu
- account for property blobs in memcg
- MAINTAINERS updates for xe + buddy
rust:
- Fix documentation for Registration constructors
- Use pin_init::zeroed() for fops initialization
- Annotate DRM helpers with __rust_helper
- Improve safety documentation for gem::Object::new()
- Update AlwaysRefCounted imports
- mm: Prevent integer overflow in page_align()
atomic:
- add drm_device pointer to drm_private_obj
- introduce gamma/degamma LUT size check
buddy:
- fix free_trees memory leak
- prevent BUG_ON
bridge:
- introduce drm_bridge_unplug/enter/exit
- add connector argument to .hpd_notify
- lots of recounting conversions
- convert rockchip inno hdmi to bridge
- lontium-lt9611uxc: switch to HDMI audio helpers
- dw-hdmi-qp: add support for HPD-less setups
- Algoltek AG6311 support
panels:
- edp: CSW MNE007QB3-1, AUO B140HAN06.4, AUO B140QAX01.H
- st75751: add SPI support
- Sitronix ST7920, Samsung LTL106HL02
- LG LH546WF1-ED01, HannStar HSD156J
- BOE NV130WUM-T08
- Innolux G150XGE-L05
- Anbernic RG-DS
dma-buf:
- improve sg_table debugging
- add tracepoints
- call clear_page instead of memset
- start to introduce cgroup memory accounting in heaps
- remove sysfs stats
dma-fence:
- add new helpers
dp:
- mst: avoid oob access with vcpi=0
hdmi:
- limit infoframes exposure to userspace
gem:
- reduce page table overhead with THP
- fix leak in drm_gem_get_unmapped_area
gpuvm:
- API sanitation for rust bindings
sched:
- introduce new helpers
panic:
- report invalid panic modes
- add kunit tests
i915/xe display:
- Expose sharpness only if num_scalers is >= 2
- Add initial Xe3P_LPD for NVL
- BMG FBC support
- Add MTL+ platforms to support dpll framework
_ fix DIMM_S DRM decoding on ICL
- Return to using AUX interrupts
- PSR/Panel replay refactoring
- use consolidation HDMI tables
- Xe3_LPD CD2X dividier changes
xe:
- vfio: add vfio_pci for intel GPU
- multi queue support
- dynamic pagemaps and multi-device SVM
- expose temp attribs in hwmon
- NO_COMPRESSION bo flag
- expose MERT OA unit
- sysfs survivability refactor
- SRIOV PF: add MERT support
- enable SR-IOV VF migration
- Enable I2C/NVM on Crescent Island
- Xe3p page reclaimation support
- introduce SRIOV scheduler groups
- add SoC remappt support in system controller
- insert compiler barriers in GuC code
- define NVL GuC firmware
- handle GT resume failure
- fix drm scheduler layering violations
- enable GSC loading and PXP for PTL
- disable GuC Power DCC strategy on PTL
- unregister drm device on probe error
i915:
- move to kernel standard fault injection
- bump recommended GuC version for DG2 and MTL
amdgpu:
- SMUIO 15.x, PSP 15.x support
- IH 6.1.1/7.1 support
- MMHUB 3.4/4.2 support
- GC 11.5.4/12.1 support
- SDMA 6.1.4/7.1/7.11.4 support
- JPEG 5.3 support
- UserQ updates
- GC 9 gfx queue reset support
- TTM memory ops parallelization
- convert legacy logging to new helpers
- DC analog fixes
amdkfd:
- GC 11.5.4/12.1 suppport
- SDMA 6.1.4/7.1 support
- per context support
- increase kfd process hash table
- Reserved SDMA rework
radeon:
- convert legacy logging to new helpers
- use devm for i2c adapters
msm:
- GPU
- Document a612/RGMU dt bindings
- UBWC 6.0 support (for A840 / Kaanapali)
- a225 support
- DPU:
- Switch to use virtual planes by default
- Fix DSI CMD panels on DPU 3.x
- Rewrite format handling to remove intermediate representation
- Fix watchdog on DPU 8.x+
- Fix TE / Vsync source setting on DPU 8.x+
- Add 3D_Mux on SC7280
- Kaanapali platform support
- Fix UBWC register programming
- Make RM reserve DSPP-enabled mixers for CRTCs with LMs
- Gamma correction support
- DP:
- Enable support for eDP 1.4+ link rate tables
- Fix MDSS1 DP indices on SA8775P, making them to work
- Fix msm_dp_ctrl_config_msa() to work with LLVM 20
- DSI:
- Document QCS8300 as compatible with SA8775P
- Kaanapali platform support
- DSI PHY:
- switch to divider_determine_rate()
- MDP5:
- Drop support for MSM8998, SDM660 and SDM630 (switch over to DPU)
- MDSS:
- Kaanapali platform support
- Fixed UBWC register programming
nova-core:
- Prepare for Turing support. This includes parsing and handling
Turing-specific firmware headers and sections as well as a Turing
Falcon HAL implementation
- Get rid of the Result<impl PinInit<T, E>> anti-pattern
- Relocate initializer-specific code into the appropriate initializer
- Use CStr::from_bytes_until_nul() to remove custom helpers
- Improve handling of unexpected firmware values
- Clean up redundant debug prints
- Replace c_str!() with native Rust C-string literals
- Update nova-core task list
nova:
- Align GEM object size to system page size
tyr:
- Use generated uAPI bindings for GpuInfo
- Replace manual sleeps with read_poll_timeout()
- Replace c_str!() with native Rust C-string literals
- Suppress warnings for unread fields
- Fix incorrect register name in print statement
nouveau:
- fix big page table support races in PTE management
- improve reclocking on tegra 186+
amdxdna:
- fix suspend race conditions
- improve handling of zero tail pointers
- fix cu_idx overwritten during command setup
- enable hardware context priority
- remove NPU2 support
- update message buffer allocation requirements
- update firmware version check
ast:
- support imported cursor buffers
- big endian fixes
etnaviv:
- add PPU flop reset support
imagination:
- add AM62P support
- introduce hw version checks
ivpu:
- implement warm boot flow
panfrost:
- add bo sync ioctl
- add GPU_PM_RT support for RZ/G3E SoC
panthor:
- add bo sync ioctl
- enable timestamp propagation
- scheduler robustness improvements
- VM termination fixes
- huge page support
rockchip:
- RK3368 HDMI Support
- get rid of atomic_check fixups
- RK3506 support
- RK3576/RK3588 improved HPD handling
rz-du:
- RZ/V2H(P) MIPI-DSI Support
v3d:
- fix DMA segment size
- convert to new logging helpers
mediatek:
- move DP training to hotplug thread
- convert logging to new helpers
- add support for HS speed DSI
- Genio 510/700/1200-EVK, Radxa NIO-12L HDMI support
atmel-hlcdc:
- switch to drmm resource
- support nomodeset
- use newer helpers
hisilicon:
- fix various DP bugs
renesas:
- fix kernel panic on reboot
exynos:
- fix vidi_connection_ioctl using wrong device
- fix vidi_connection deref user ptr
- fix concurrency regression with vidi_context
vkms:
- add configfs support for display configuration
* tag 'drm-next-2026-02-11' of https://gitlab.freedesktop.org/drm/kernel: (1610 commits)
drm/xe/pm: Disable D3Cold for BMG only on specific platforms
drm/xe: Fix kerneldoc for xe_tlb_inval_job_alloc_dep
drm/xe: Fix kerneldoc for xe_gt_tlb_inval_init_early
drm/xe: Fix kerneldoc for xe_migrate_exec_queue
drm/xe/query: Fix topology query pointer advance
drm/xe/guc: Fix kernel-doc warning in GuC scheduler ABI header
drm/xe/guc: Fix CFI violation in debugfs access.
accel/amdxdna: Move RPM resume into job run function
accel/amdxdna: Fix incorrect DPM level after suspend/resume
nouveau/vmm: start tracking if the LPT PTE is valid. (v6)
nouveau/vmm: increase size of vmm pte tracker struct to u32 (v2)
nouveau/vmm: rewrite pte tracker using a struct and bitfields.
accel/amdxdna: Fix incorrect error code returned for failed chain command
accel/amdxdna: Remove hardware context status
drm/bridge: imx8qxp-pixel-combiner: Fix bailout for imx8qxp_pc_bridge_probe()
drm/panel: ilitek-ili9882t: Remove duplicate initializers in tianma_il79900a_dsc
drm/i915/display: fix the pixel normalization handling for xe3p_lpd
drm/exynos: vidi: use ctx->lock to protect struct vidi_context member variables related to memory alloc/free
drm/exynos: vidi: fix to avoid directly dereferencing user pointer
drm/exynos: vidi: use priv->vidi_dev for ctx lookup in vidi_connection_ioctl()
...
|
|
Firmware and monitoring tools may not be ready to receive a CPER when we
read the bad pages, so send the CPERs at the end of RAS initialization
to ensure that the FW is ready to receive and process the CPER. This
removes the previous CPER submission that was added during bad page
load, and sends both in-band and out-of-band at the same time.
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The commit 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise
disconnect") introduced early KFD cleanup when drm_dev_is_unplugged()
returns true. However, this causes hangs during normal module unload
(rmmod amdgpu).
The issue occurs because drm_dev_unplug() is called in amdgpu_pci_remove()
for all removal scenarios, not just surprise disconnects. This was done
intentionally in commit 39934d3ed572 ("Revert "drm/amdgpu: TA unload
messages are not actually sent to psp when amdgpu is uninstalled"") to
fix IGT PCI software unplug test failures. As a result,
drm_dev_is_unplugged() returns true even during normal module unload,
triggering the early KFD cleanup inappropriately.
The correct check should distinguish between:
- Actual surprise disconnect (eGPU unplugged): pci_dev_is_disconnected()
returns true
- Normal module unload (rmmod): pci_dev_is_disconnected() returns false
Replace drm_dev_is_unplugged() with pci_dev_is_disconnected() to ensure
the early cleanup only happens during true hardware disconnect events.
Cc: stable@vger.kernel.org
Reported-by: Cal Peake <cp@absolutedigital.net>
Closes: https://lore.kernel.org/all/b0c22deb-c0fa-3343-33cf-fd9a77d7db99@absolutedigital.net/
Fixes: 6a23e7b4332c ("drm/amd: Clean up kfd node on surprise disconnect")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Returning DRM_GPU_SCHED_STAT_NO_HANG causes the scheduler
to add the bad job back the pending list. We've already
set the errors on the fence and killed the bad job at this point
so it's the correct behavior.
Acked-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
sdma ring reset is not supported in SRIOV. kfd driver does not check
reset mask, and could queue sdma ring reset during unmap_queues_cpsch.
Avoid the ring reset for sriov.
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
In low memory conditions, kmalloc can fail. In such conditions
unlock the mutex for a clean exit.
We do not need to amdgpu_bo_list_put as it's been handled in the
amdgpu_cs_parser_fini.
Fixes: 737da5363cc0 ("drm/amdgpu: update the functions to use amdgpu version of hmm")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202602030017.7E0xShmH-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Regardless if CPU enable 5-level paging, GPU vm use 5-level paging if
gmc init with 57-bit address space support, because
ARM64 4-level paging support 48-bit VA, x86 and GPU 4-level paging
support 47-bit VA, require 5-level paging on GPU to support ARM64.
NPA address space 52-bit mapping on NPA GPU VM require 5-level paging.
Debugger trap get device snapshot expect LDS and Scratch base, limit
above 57-bit, which is set only for 5-level paging.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.19.x
|
|
Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU
register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all()
and amdgpu_fence_driver_hw_fini().
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
There is no firmware version dependency. This also
enables sdma queue resets on all SDMA 6.x based
chips.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There is no firmware version dependency. This also
enables sdma queue resets on all SDMA 5.2.x based
chips.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There is no firmware version dependency.
Fixes: 59fd50b8663b ("drm/amdgpu: Add sysfs interface for sdma reset mask")
Cc: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Jesse.Zhang <Jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
When amdgpu_nbio_ras_sw_init() fails in amdgpu_ras_init(), the function
returns directly without freeing the allocated con structure, leading
to a memory leak.
Fix this by jumping to the release_con label to properly clean up the
allocated memory before returning the error code.
Compile tested only. Issue found using a prototype static analysis tool
and code review.
Fixes: fdc94d3a8c88 ("drm/amdgpu: Rework pcie_bif ras sw_init")
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|