summaryrefslogtreecommitdiffstats
path: root/kernel/events
AgeCommit message (Collapse)AuthorLines
11 daysMerge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
12 daysMerge tag 'perf-core-2026-04-13' of ↵Linus Torvalds-22/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core updates: - Try to allocate task_ctx_data quickly, to optimize O(N^2) algorithm on large systems with O(100k) threads (Namhyung Kim) AMD PMU driver IBS support updates and fixes, by Ravi Bangoria: - Fix interrupt accounting for discarded samples - Fix a Zen5-specific quirk - Fix PhyAddrVal handling - Fix NMI-safety with perf_allow_kernel() - Fix a race between event add and NMIs Intel PMU driver updates: - Only check GP counters for PEBS constraints validation (Dapeng Mi) MSR driver: - Turn SMI_COUNT and PPERF on by default, instead of a long list of CPU models to enable them on (Kan Liang) ... and misc cleanups and fixes by Aldf Conte, Anshuman Khandual, Namhyung Kim, Ravi Bangoria and Yen-Hsiang Hsu" * tag 'perf-core-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/events: Replace READ_ONCE() with standard pgtable accessors perf/x86/msr: Make SMI and PPERF on by default perf/x86/intel/p4: Fix unused variable warning in p4_pmu_init() perf/x86/intel: Only check GP counters for PEBS constraints validation perf/x86/amd/ibs: Fix comment typo in ibs_op_data perf/amd/ibs: Advertise remote socket capability perf/amd/ibs: Enable streaming store filter perf/amd/ibs: Enable RIP bit63 hardware filtering perf/amd/ibs: Enable fetch latency filtering perf/amd/ibs: Support IBS_{FETCH|OP}_CTL2[Dis] to eliminate RMW race perf/amd/ibs: Add new MSRs and CPUID bits definitions perf/amd/ibs: Define macro for ldlat mask and shift perf/amd/ibs: Avoid race between event add and NMI perf/amd/ibs: Avoid calling perf_allow_kernel() from the IBS NMI handler perf/amd/ibs: Preserve PhyAddrVal bit when clearing PhyAddr MSR perf/amd/ibs: Limit ldlat->l3missonly dependency to Zen5 perf/amd/ibs: Account interrupt for discarded samples perf/core: Simplify __detach_global_ctx_data() perf/core: Try to allocate task_ctx_data quickly perf/core: Pass GFP flags to attach_task_ctx_data()
13 daysMerge tag 'vfs-7.1-rc1.kino' of ↵Linus Torvalds-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs i_ino updates from Christian Brauner: "For historical reasons, the inode->i_ino field is an unsigned long, which means that it's 32 bits on 32 bit architectures. This has caused a number of filesystems to implement hacks to hash a 64-bit identifier into a 32-bit field, and deprives us of a universal identifier field for an inode. This changes the inode->i_ino field from an unsigned long to a u64. This shouldn't make any material difference on 64-bit hosts, but 32-bit hosts will see struct inode grow by at least 4 bytes. This could have effects on slabcache sizes and field alignment. The bulk of the changes are to format strings and tracepoints, since the kernel itself doesn't care that much about the i_ino field. The first patch changes some vfs function arguments, so check that one out carefully. With this change, we may be able to shrink some inode structures. For instance, struct nfs_inode has a fileid field that holds the 64-bit inode number. With this set of changes, that field could be eliminated. I'd rather leave that sort of cleanups for later just to keep this simple" * tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group() EVM: add comment describing why ino field is still unsigned long vfs: remove externs from fs.h on functions modified by i_ino widening treewide: fix missed i_ino format specifier conversions ext4: fix signed format specifier in ext4_load_inode trace event treewide: change inode->i_ino from unsigned long to u64 nilfs2: widen trace event i_ino fields to u64 f2fs: widen trace event i_ino fields to u64 ext4: widen trace event i_ino fields to u64 zonefs: widen trace event i_ino fields to u64 hugetlbfs: widen trace event i_ino fields to u64 ext2: widen trace event i_ino fields to u64 cachefiles: widen trace event i_ino fields to u64 vfs: widen trace event i_ino fields to u64 net: change sock.sk_ino and sock_i_ino() to u64 audit: widen ino fields to u64 vfs: widen inode hash/lookup functions to u64
2026-04-08perf/events: Replace READ_ONCE() with standard pgtable accessorsAnshuman Khandual-3/+3
Replace raw READ_ONCE() dereferences of pgtable entries with corresponding standard page table accessors pxdp_get() in perf_get_pgtable_size(). These accessors default to READ_ONCE() on platforms that don't override them. So there is no functional change on such platforms. However arm64 platform is being extended to support 128 bit page tables via a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not provide required single copy atomic access for 128 bit page table entries. Although pxdp_get() accessors can later be overridden on arm64 platform to extend required single copy atomicity support on 128 bit entries. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com
2026-04-05mm: rename zap_page_range_single() to zap_vma_range()David Hildenbrand (Arm)-1/+1
Let's rename it to make it better match our new naming scheme. While at it, polish the kerneldoc. [akpm@linux-foundation.org: fix rustfmtcheck] Link: https://lkml.kernel.org/r/20260227200848.114019-15-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05mm/memory: remove "zap_details" parameter from zap_page_range_single()David Hildenbrand (Arm)-1/+1
Nobody except memory.c should really set that parameter to non-NULL. So let's just drop it and make unmap_mapping_range_vma() use zap_page_range_single_batched() instead. [david@kernel.org: format on a single line] Link: https://lkml.kernel.org/r/8a27e9ac-2025-4724-a46d-0a7c90894ba7@kernel.org Link: https://lkml.kernel.org/r/20260227200848.114019-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Puranjay Mohan <puranjay@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Arve <arve@android.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Airlie <airlied@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jann Horn <jannh@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Todd Kjos <tkjos@android.com> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-12perf: Make sure to use pmu_ctx->pmu for groupsPeter Zijlstra-11/+8
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access when group_sched_in() fails and needs to roll back. This *should* be handled by the transaction callbacks, but he found that when the group leader is a software event, the transaction handlers of the wrong PMU are used. Despite the move_group case in perf_event_open() and group_sched_in() using pmu_ctx->pmu. Turns out, inherit uses event->pmu to clone the events, effectively undoing the move_group case for all inherited contexts. Fix this by also making inherit use pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context. Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the group case. Fixes: bd2756811766 ("perf: Rewrite core context handling") Reported-by: Oliver Rosenberg <olrose55@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net
2026-03-06treewide: change inode->i_ino from unsigned long to u64Jeff Layton-2/+2
On 32-bit architectures, unsigned long is only 32 bits wide, which causes 64-bit inode numbers to be silently truncated. Several filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that exceed 32 bits, and this truncation can lead to inode number collisions and other subtle bugs on 32-bit systems. Change the type of inode->i_ino from unsigned long to u64 to ensure that inode numbers are always represented as 64-bit values regardless of architecture. Update all format specifiers treewide from %lu/%lx to %llu/%llx to match the new type, along with corresponding local variable types. This is the bulk treewide conversion. Earlier patches in this series handled trace events separately to allow trace field reordering for better struct packing on 32-bit. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-27perf/core: Simplify __detach_global_ctx_data()Namhyung Kim-11/+4
Like in the attach_global_ctx_data() it has a O(N^2) loop to delete task context data for each thread. But perf_free_ctx_data_rcu() can be called under RCU read lock, so just calls it directly rather than iterating the whole thread list again. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-4-namhyung@kernel.org
2026-02-27perf/core: Try to allocate task_ctx_data quicklyNamhyung Kim-0/+6
The attach_global_ctx_data() has O(N^2) algorithm to allocate the context data for each thread. This caused perfomance problems on large systems with O(100k) threads. Because kmalloc(GFP_KERNEL) can go sleep it cannot be called under the RCU lock. So let's try with GFP_NOWAIT first so that it can proceed in normal cases. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-3-namhyung@kernel.org
2026-02-27perf/core: Pass GFP flags to attach_task_ctx_data()Namhyung Kim-8/+8
This is a preparation for the next change to reduce the computational complexity in the global context data handling for LBR callstacks. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-2-namhyung@kernel.org
2026-02-25perf: Fix __perf_event_overflow() vs perf_remove_from_context() racePeter Zijlstra-1/+41
Make sure that __perf_event_overflow() runs with IRQs disabled for all possible callchains. Specifically the software events can end up running it with only preemption disabled. This opens up a race vs perf_event_exit_event() and friends that will go and free various things the overflow path expects to be present, like the BPF program. Fixes: 592903cdcbf6 ("perf_counter: add an event_list") Reported-by: Simond Hu <cmdhh1767@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Simond Hu <cmdhh1767@gmail.com> Link: https://patch.msgid.link/20260224122909.GV1395416@noisy.programming.kicks-ass.net
2026-02-23perf/core: Fix refcount bug and potential UAF in perf_mmapHaocheng Yu-19/+19
Syzkaller reported a refcount_t: addition on 0; use-after-free warning in perf_mmap. The issue is caused by a race condition between a failing mmap() setup and a concurrent mmap() on a dependent event (e.g., using output redirection). In perf_mmap(), the ring_buffer (rb) is allocated and assigned to event->rb with the mmap_mutex held. The mutex is then released to perform map_range(). If map_range() fails, perf_mmap_close() is called to clean up. However, since the mutex was dropped, another thread attaching to this event (via inherited events or output redirection) can acquire the mutex, observe the valid event->rb pointer, and attempt to increment its reference count. If the cleanup path has already dropped the reference count to zero, this results in a use-after-free or refcount saturation warning. Fix this by extending the scope of mmap_mutex to cover the map_range() call. This ensures that the ring buffer initialization and mapping (or cleanup on failure) happens atomically effectively, preventing other threads from accessing a half-initialized or dying ring buffer. Closes: https://lore.kernel.org/oe-kbuild-all/202602020208.m7KIjdzW-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Haocheng Yu <yuhaocheng035@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260202162057.7237-1-yuhaocheng035@gmail.com
2026-02-23perf/core: Fix invalid wait context in ctx_sched_in()Namhyung Kim-1/+2
Lockdep found a bug in the event scheduling when a pinned event was failed and wakes up the threads in the ring buffer like below. It seems it should not grab a wait-queue lock under perf-context lock. Let's do it with irq_work. [ 39.913691] ============================= [ 39.914157] [ BUG: Invalid wait context ] [ 39.914623] 6.15.0-next-20250530-next-2025053 #1 Not tainted [ 39.915271] ----------------------------- [ 39.915731] repro/837 is trying to lock: [ 39.916191] ffff88801acfabd8 (&event->waitq){....}-{3:3}, at: __wake_up+0x26/0x60 [ 39.917182] other info that might help us debug this: [ 39.917761] context-{5:5} [ 39.918079] 4 locks held by repro/837: [ 39.918530] #0: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: __perf_event_task_sched_in+0xd1/0xbc0 [ 39.919612] #1: ffff88806ca3c6f8 (&cpuctx_lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1a7/0xbc0 [ 39.920748] #2: ffff88800d91fc18 (&ctx->lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1f9/0xbc0 [ 39.921819] #3: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: perf_event_wakeup+0x6c/0x470 Fixes: f4b07fd62d4d ("perf/core: Use POLLHUP for a pinned event in error") Closes: https://lore.kernel.org/lkml/aD2w50VDvGIH95Pf@ly-workstation Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Link: https://patch.msgid.link/20250603045105.1731451-1-namhyung@kernel.org
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook-1/+1
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-12/+12
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-15/+16
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-10Merge tag 'perf-core-2026-02-09' of ↵Linus Torvalds-132/+436
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance event updates from Ingo Molnar: "x86 PMU driver updates: - Add support for the core PMU for Intel Diamond Rapids (DMR) CPUs (Dapeng Mi) Compared to previous iterations of the Intel PMU code, there's been a lot of changes, which center around three main areas: - Introduce the OFF-MODULE RESPONSE (OMR) facility to replace the Off-Core Response (OCR) facility - New PEBS data source encoding layout - Support the new "RDPMC user disable" feature - Likewise, a large series adds uncore PMU support for Intel Diamond Rapids (DMR) CPUs (Zide Chen) This centers around these four main areas: - DMR may have two Integrated I/O and Memory Hub (IMH) dies, separate from the compute tile (CBB) dies. Each CBB and each IMH die has its own discovery domain. - Unlike prior CPUs that retrieve the global discovery table portal exclusively via PCI or MSR, DMR uses PCI for IMH PMON discovery and MSR for CBB PMON discovery. - DMR introduces several new PMON types: SCA, HAMVF, D2D_ULA, UBR, PCIE4, CRS, CPC, ITC, OTC, CMS, and PCIE6. - IIO free-running counters in DMR are MMIO-based, unlike SPR. - Also add support for Add missing PMON units for Intel Panther Lake, and support Nova Lake (NVL), which largely maps to Panther Lake. (Zide Chen) - KVM integration: Add support for mediated vPMUs (by Kan Liang and Sean Christopherson, with fixes and cleanups by Peter Zijlstra, Sandipan Das and Mingwei Zhang) - Add Intel cstate driver to support for Wildcat Lake (WCL) CPUs, which are a low-power variant of Panther Lake (Zide Chen) - Add core, cstate and MSR PMU support for the Airmont NP Intel CPU (aka MaxLinear Lightning Mountain), which maps to the existing Airmont code (Martin Schiller) Performance enhancements: - Speed up kexec shutdown by avoiding unnecessary cross CPU calls (Jan H. Schönherr) - Fix slow perf_event_task_exit() with LBR callstacks (Namhyung Kim) User-space stack unwinding support: - Various cleanups and refactorings in preparation to generalize the unwinding code for other architectures (Jens Remus) Uprobes updates: - Transition from kmap_atomic to kmap_local_page (Keke Ming) - Fix incorrect lockdep condition in filter_chain() (Breno Leitao) - Fix XOL allocation failure for 32-bit tasks (Oleg Nesterov) Misc fixes and cleanups: - s390: Remove kvm_types.h from Kbuild (Randy Dunlap) - x86/intel/uncore: Convert comma to semicolon (Chen Ni) - x86/uncore: Clean up const mismatch (Greg Kroah-Hartman) - x86/ibs: Fix typo in dc_l2tlb_miss comment (Xiang-Bin Shi)" * tag 'perf-core-2026-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) s390: remove kvm_types.h from Kbuild uprobes: Fix incorrect lockdep condition in filter_chain() x86/ibs: Fix typo in dc_l2tlb_miss comment x86/uprobes: Fix XOL allocation failure for 32-bit tasks perf/x86/intel/uncore: Convert comma to semicolon perf/x86/intel: Add support for rdpmc user disable feature perf/x86: Use macros to replace magic numbers in attr_rdpmc perf/x86/intel: Add core PMU support for Novalake perf/x86/intel: Add support for PEBS memory auxiliary info field in NVL perf/x86/intel: Add core PMU support for DMR perf/x86/intel: Add support for PEBS memory auxiliary info field in DMR perf/x86/intel: Support the 4 new OMR MSRs introduced in DMR and NVL perf/core: Fix slow perf_event_task_exit() with LBR callstacks perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU calls uprobes: use kmap_local_page() for temporary page mappings arm/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() mips/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() arm64/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() riscv/uprobes: use kmap_local_page() in arch_uprobe_copy_ixol() perf/x86/intel/uncore: Add Nova Lake support ...
2026-02-02uprobes: Fix incorrect lockdep condition in filter_chain()Breno Leitao-1/+1
The list_for_each_entry_rcu() in filter_chain() uses rcu_read_lock_trace_held() as the lockdep condition, but the function holds consumer_rwsem, not the RCU trace lock. This gives me the following output when running with some locking debug option enabled: kernel/events/uprobes.c:1141 RCU-list traversed in non-reader section!! filter_chain register_for_each_vma uprobe_unregister_nosync __probe_event_disable Remove the incorrect lockdep condition since the rwsem provides sufficient protection for the list traversal. Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260128-uprobe_rcu-v2-1-994ea6d32730@debian.org
2026-01-30perf: sched: Fix perf crash with new is_user_task() helperSteven Rostedt-4/+4
In order to do a user space stacktrace the current task needs to be a user task that has executed in user space. It use to be possible to test if a task is a user task or not by simply checking the task_struct mm field. If it was non NULL, it was a user task and if not it was a kernel task. But things have changed over time, and some kernel tasks now have their own mm field. An idea was made to instead test PF_KTHREAD and two functions were used to wrap this check in case it became more complex to test if a task was a user task or not[1]. But this was rejected and the C code simply checked the PF_KTHREAD directly. It was later found that not all kernel threads set PF_KTHREAD. The io-uring helpers instead set PF_USER_WORKER and this needed to be added as well. But checking the flags is still not enough. There's a very small window when a task exits that it frees its mm field and it is set back to NULL. If perf were to trigger at this moment, the flags test would say its a user space task but when perf would read the mm field it would crash with at NULL pointer dereference. Now there are flags that can be used to test if a task is exiting, but they are set in areas that perf may still want to profile the user space task (to see where it exited). The only real test is to check both the flags and the mm field. Instead of making this modification in every location, create a new is_user_task() helper function that does all the tests needed to know if it is safe to read the user space memory or not. [1] https://lore.kernel.org/all/20250425204120.639530125@goodmis.org/ Fixes: 90942f9fac05 ("perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL") Closes: https://lore.kernel.org/all/0d877e6f-41a7-4724-875d-0b0a27b8a545@roeck-us.net/ Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260129102821.46484722@gandalf.local.home
2026-01-21perf: Fix refcount warning on event->mmap_count incrementWill Rosenberg-0/+9
When calling refcount_inc(&event->mmap_count) inside perf_mmap_rb(), the following warning is triggered: refcount_t: addition on 0; use-after-free. WARNING: lib/refcount.c:25 PoC: struct perf_event_attr attr = {0}; int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0); mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); int victim = syscall(__NR_perf_event_open, &attr, 0, -1, fd, PERF_FLAG_FD_OUTPUT); mmap(NULL, 0x3000, PROT_READ | PROT_WRITE, MAP_SHARED, victim, 0); This occurs when creating a group member event with the flag PERF_FLAG_FD_OUTPUT. The group leader should be mmap-ed and then mmap-ing the event triggers the warning. Since the event has copied the output_event in perf_event_set_output(), event->rb is set. As a result, perf_mmap_rb() calls refcount_inc(&event->mmap_count) when event->mmap_count = 0. Disallow the case when event->mmap_count = 0. This also prevents two events from updating the same user_page. Fixes: 448f97fba901 ("perf: Convert mmap() refcounts to refcount_t") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Rosenberg <whrosenb@asu.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260119184956.801238-1-whrosenb@asu.edu
2026-01-16x86/uprobes: Fix XOL allocation failure for 32-bit tasksOleg Nesterov-3/+7
This script #!/usr/bin/bash echo 0 > /proc/sys/kernel/randomize_va_space echo 'void main(void) {}' > TEST.c # -fcf-protection to ensure that the 1st endbr32 insn can't be emulated gcc -m32 -fcf-protection=branch TEST.c -o test bpftrace -e 'uprobe:./test:main {}' -c ./test "hangs", the probed ./test task enters an endless loop. The problem is that with randomize_va_space == 0 get_unmapped_area(TASK_SIZE - PAGE_SIZE) called by xol_add_vma() can not just return the "addr == TASK_SIZE - PAGE_SIZE" hint, this addr is used by the stack vma. arch_get_unmapped_area_topdown() doesn't take TIF_ADDR32 into account and in_32bit_syscall() is false, this leads to info.high_limit > TASK_SIZE. vm_unmapped_area() happily returns the high address > TASK_SIZE and then get_unmapped_area() returns -ENOMEM after the "if (addr > TASK_SIZE - len)" check. handle_swbp() doesn't report this failure (probably it should) and silently restarts the probed insn. Endless loop. I think that the right fix should change the x86 get_unmapped_area() paths to rely on TIF_ADDR32 rather than in_32bit_syscall(). Note also that if CONFIG_X86_X32_ABI=y, in_x32_syscall() falsely returns true in this case because ->orig_ax = -1. But we need a simple fix for -stable, so this patch just sets TS_COMPAT if the probed task is 32-bit to make in_ia32_syscall() true. Fixes: 1b028f784e8c ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Reported-by: Paulo Andrade <pandrade@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/aV5uldEvV7pb4RA8@redhat.com/ Cc: stable@vger.kernel.org Link: https://patch.msgid.link/aWO7Fdxn39piQnxu@redhat.com
2026-01-15perf/core: Fix slow perf_event_task_exit() with LBR callstacksNamhyung Kim-2/+18
I got a report that a task is stuck in perf_event_exit_task() waiting for global_ctx_data_rwsem. On large systems with lots threads, it'd have performance issues when it grabs the lock to iterate all threads in the system to allocate the context data. And it'd block task exit path which is problematic especially under memory pressure. perf_event_open perf_event_alloc attach_perf_ctx_data attach_global_ctx_data percpu_down_write (global_ctx_data_rwsem) for_each_process_thread alloc_task_ctx_data do_exit perf_event_exit_task percpu_down_read (global_ctx_data_rwsem) It should not hold the global_ctx_data_rwsem on the exit path. Let's skip allocation for exiting tasks and free the data carefully. Reported-by: Rosalie Fang <rosaliefang@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260112165157.1919624-1-namhyung@kernel.org
2026-01-13perf/core: Speed up kexec shutdown by avoiding unnecessary cross CPU callsJan H. Schönherr-1/+2
There are typically a lot of PMUs registered, but in many cases only few of them have an event registered (like the "cpu" PMU in the presence of the watchdog). As the mutex is already held, it's safe to just check for existing events before doing the cross CPU call. This change saves tens of milliseconds from kexec time (perceived as steal time during a hypervisor host update), with <2ms remaining for this step in the shutdown. There might be additional potential for parallelization or we could just disable performance monitoring during the actual shutdown and be less graceful about it. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-01-11Merge tag 'perf-urgent-2026-01-11' of ↵Linus Torvalds-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix perf swevent hrtimer deinit regression" * tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ensure swevent hrtimer is properly destroyed
2026-01-11treewide: Update email addressThomas Gleixner-3/+3
In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-06uprobes: use kmap_local_page() for temporary page mappingsKeke Ming-6/+6
Replace deprecated kmap_atomic() with kmap_local_page(). Signed-off-by: Keke Ming <ming.jvle@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://patch.msgid.link/20260103084243.195125-6-ming.jvle@gmail.com
2026-01-05perf: Ensure swevent hrtimer is properly destroyedPeter Zijlstra-0/+6
With the change to hrtimer_try_to_cancel() in perf_swevent_cancel_hrtimer() it appears possible for the hrtimer to still be active by the time the event gets freed. Make sure the event does a full hrtimer_cancel() on the free path by installing a perf_event::destroy handler. Fixes: eb3182ef0405 ("perf/core: Fix system hang caused by cpu-clock usage") Reported-by: CyberUnicorns <a101e_iotvul@163.com> Tested-by: CyberUnicorns <a101e_iotvul@163.com> Debugged-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-12-19perf: Use EXPORT_SYMBOL_FOR_KVM() for the mediated APIsPeter Zijlstra-2/+3
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-17perf: Clean up mediated vPMU accountingPeter Zijlstra-2/+7
The mediated_pmu_account_event() and perf_create_mediated_pmu() functions implement the exclusion between '!exclude_guest' counters and mediated vPMUs. Their implementation is basically identical, except mirrored in what they count/check. Make sure the actual implementations reflect this similarity. Notably: - while perf_release_mediated_pmu() has an underflow check; mediated_pmu_unaccount_event() did not. - while perf_create_mediated_pmu() has an inc_not_zero() path; mediated_pmu_account_event() did not. Also, the inc_not_zero() path can be outsite of perf_mediated_pmu_mutex. The mutex must guard the 0->1 (of either nr_include_guest_events or nr_mediated_pmu_vms) transition, but once a counter is already non-zero, it can safely be incremented further. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-17perf/x86/core: Register a new vector for handling mediated guest PMIsSean Christopherson-2/+7
Wire up system vector 0xf5 for handling PMIs (i.e. interrupts delivered through the LVTPC) while running KVM guests with a mediated PMU. Perf currently delivers all PMIs as NMIs, e.g. so that events that trigger while IRQs are disabled aren't delayed and generate useless records, but due to the multiplexing of NMIs throughout the system, correctly identifying NMIs for a mediated PMU is practically infeasible. To (greatly) simplify identifying guest mediated PMU PMIs, perf will switch the CPU's LVTPC between PERF_GUEST_MEDIATED_PMI_VECTOR and NMI when guest PMU context is loaded/put. I.e. PMIs that are generated by the CPU while the guest is active will be identified purely based on the IRQ vector. Route the vector through perf, e.g. as opposed to letting KVM attach a handler directly a la posted interrupt notification vectors, as perf owns the LVTPC and thus is the rightful owner of PERF_GUEST_MEDIATED_PMI_VECTOR. Functionally, having KVM directly own the vector would be fine (both KVM and perf will be completely aware of when a mediated PMU is active), but would lead to an undesirable split in ownership: perf would be responsible for installing the vector, but not handling the resulting IRQs. Add a new perf_guest_info_callbacks hook (and static call) to allow KVM to register its handler with perf when running guests with mediated PMUs. Note, because KVM always runs guests with host IRQs enabled, there is no danger of a PMI being delayed from the guest's perspective due to using a regular IRQ instead of an NMI. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-9-seanjc@google.com
2025-12-17perf: Add APIs to load/put guest mediated PMU contextKan Liang-0/+61
Add exported APIs to load/put a guest mediated PMU context. KVM will load the guest PMU shortly before VM-Enter, and put the guest PMU shortly after VM-Exit. On the perf side of things, schedule out all exclude_guest events when the guest context is loaded, and schedule them back in when the guest context is put. I.e. yield the hardware PMU resources to the guest, by way of KVM. Note, perf is only responsible for managing host context. KVM is responsible for loading/storing guest state to/from hardware. [sean: shuffle patches around, write changelog] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-8-seanjc@google.com
2025-12-17perf: Add a EVENT_GUEST flagKan Liang-51/+179
Current perf doesn't explicitly schedule out all exclude_guest events while the guest is running. There is no problem with the current emulated vPMU. Because perf owns all the PMU counters. It can mask the counter which is assigned to an exclude_guest event when a guest is running (Intel way), or set the corresponding HOSTONLY bit in evsentsel (AMD way). The counter doesn't count when a guest is running. However, either way doesn't work with the introduced mediated vPMU. A guest owns all the PMU counters when it's running. The host should not mask any counters. The counter may be used by the guest. The evsentsel may be overwritten. Perf should explicitly schedule out all exclude_guest events to release the PMU resources when entering a guest, and resume the counting when exiting the guest. It's possible that an exclude_guest event is created when a guest is running. The new event should not be scheduled in as well. The ctx time is shared among different PMUs. The time cannot be stopped when a guest is running. It is required to calculate the time for events from other PMUs, e.g., uncore events. Add timeguest to track the guest run time. For an exclude_guest event, the elapsed time equals the ctx time - guest time. Cgroup has dedicated times. Use the same method to deduct the guest time from the cgroup time as well. [sean: massage comments] Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-7-seanjc@google.com
2025-12-17perf: Clean up perf ctx timeKan Liang-38/+32
The current perf tracks two timestamps for the normal ctx and cgroup. The same type of variables and similar codes are used to track the timestamps. In the following patch, the third timestamp to track the guest time will be introduced. To avoid the code duplication, add a new struct perf_time_ctx and factor out a generic function update_perf_time_ctx(). No functional change. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-6-seanjc@google.com
2025-12-17perf: Add APIs to create/release mediated guest vPMUsKan Liang-0/+82
Currently, exposing PMU capabilities to a KVM guest is done by emulating guest PMCs via host perf events, i.e. by having KVM be "just" another user of perf. As a result, the guest and host are effectively competing for resources, and emulating guest accesses to vPMU resources requires expensive actions (expensive relative to the native instruction). The overhead and resource competition results in degraded guest performance and ultimately very poor vPMU accuracy. To address the issues with the perf-emulated vPMU, introduce a "mediated vPMU", where the data plane (PMCs and enable/disable knobs) is exposed directly to the guest, but the control plane (event selectors and access to fixed counters) is managed by KVM (via MSR interceptions). To allow host perf usage of the PMU to (partially) co-exist with KVM/guest usage of the PMU, KVM and perf will coordinate to a world switch between host perf context and guest vPMU context near VM-Enter/VM-Exit. Add two exported APIs, perf_{create,release}_mediated_pmu(), to allow KVM to create and release a mediated PMU instance (per VM). Because host perf context will be deactivated while the guest is running, mediated PMU usage will be mutually exclusive with perf analysis of the guest, i.e. perf events that do NOT exclude the guest will not behave as expected. To avoid silent failure of !exclude_guest perf events, disallow creating a mediated PMU if there are active !exclude_guest events, and on the perf side, disallowing creating new !exclude_guest perf events while there is at least one active mediated PMU. Exempt PMU resources that do not support mediated PMU usage, i.e. that are outside the scope/view of KVM's vPMU and will not be swapped out while the guest is running. Guard mediated PMU with a new kconfig to help readers identify code paths that are unique to mediated PMU support, and to allow for adding arch- specific hooks without stubs. KVM x86 is expected to be the only KVM architecture to support a mediated PMU in the near future (e.g. arm64 is trending toward a partitioned PMU implementation), and KVM x86 will select PERF_GUEST_MEDIATED_PMU unconditionally, i.e. won't need stubs. Immediately select PERF_GUEST_MEDIATED_PMU when KVM x86 is enabled so that all paths are compile tested. Full KVM support is on its way... [sean: add kconfig and WARNing, rewrite changelog, swizzle patch ordering] Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-5-seanjc@google.com
2025-12-17perf: Move security_perf_event_free() call to __free_event()Sean Christopherson-2/+2
Move the freeing of any security state associated with a perf event from _free_event() to __free_event(), i.e. invoke security_perf_event_free() in the error paths for perf_event_alloc(). This will allow adding potential error paths in perf_event_alloc() that can occur after allocating security state. Note, kfree() and thus security_perf_event_free() is a nop if event->security is NULL, i.e. calling security_perf_event_free() even if security_perf_event_alloc() fails or is never reached is functionality ok. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-4-seanjc@google.com
2025-12-17perf: Add generic exclude_guest supportKan Liang-7/+8
Only KVM knows the exact time when a guest is entering/exiting. Expose two interfaces to KVM to switch the ownership of the PMU resources. All the pinned events must be scheduled in first. Extend the perf_event_sched_in() helper to support extra flag, e.g., EVENT_GUEST. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-3-seanjc@google.com
2025-12-17perf: Skip pmu_ctx based on event_typeKan Liang-34/+40
To optimize the cgroup context switch, the perf_event_pmu_context iteration skips the PMUs without cgroup events. A bool cgroup was introduced to indicate the case. It can work, but this way is hard to extend for other cases, e.g. skipping non-mediated PMUs. It doesn't make sense to keep adding bool variables. Pass the event_type instead of the specific bool variable. Check both the event_type and related pmu_ctx variables to decide whether skipping a PMU. Event flags, e.g., EVENT_CGROUP, should be cleard in the ctx->is_active. Add EVENT_FLAGS to indicate such event flags. No functional change. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://patch.msgid.link/20251206001720.468579-2-seanjc@google.com
2025-12-09perf/core: Fix missing read event generation on task exitThaumy Cheng-10/+12
For events with inherit_stat enabled, a "read" event will be generated to collect per task event counts on task exit. The call chain is as follows: do_exit -> perf_event_exit_task -> perf_event_exit_task_context -> perf_event_exit_event -> perf_remove_from_context -> perf_child_detach -> sync_child_event -> perf_event_read_event However, the child event context detaches the task too early in perf_event_exit_task_context, which causes sync_child_event to never generate the read event in this case, since child_event->ctx->task is always set to TASK_TOMBSTONE. Fix that by moving context lock section backward to ensure ctx->task is not set to TASK_TOMBSTONE before generating the read event. Because perf_event_free_task calls perf_event_exit_task_context with exit = false to tear down all child events from the context, and the task never lived, accessing the task PID can lead to a use-after-free. To fix that, let sync_child_event read task from argument and move the call to the only place it should be triggered to avoid the effect of setting ctx->task to TASK_TOMESTONE, and add a task parameter to perf_event_exit_event to trigger the sync_child_event properly when needed. This bug can be reproduced by running "perf record -s" and attaching to any program that generates perf events in its child tasks. If we check the result with "perf report -T", the last line of the report will leave an empty table like "# PID TID", which is expected to contain the per-task event counts by design. Fixes: ef54c1a476ae ("perf: Rework perf_event_exit_event()") Signed-off-by: Thaumy Cheng <thaumy.love@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: linux-perf-users@vger.kernel.org Link: https://patch.msgid.link/20251209041600.963586-1-thaumy.love@gmail.com
2025-12-03perf/uprobes: Remove <space><Tab> whitespace noiseIngo Molnar-4/+4
A few cases of space-Tab noise snuck in. Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/176478594889.498.15611228524880763978.tip-bot2@tip-bot2
2025-12-01Merge tag 'perf-core-2025-12-01' of ↵Linus Torvalds-3/+89
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Callchain support: - Add support for deferred user-space stack unwinding for perf, enabled on x86. (Peter Zijlstra, Steven Rostedt) - unwind_user/x86: Enable frame pointer unwinding on x86 (Josh Poimboeuf) x86 PMU support and infrastructure: - x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra) - x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra) Intel PMU driver: - Large series to prepare for and implement architectural PEBS support for Intel platforms such as Clearwater Forest (CWF) and Panther Lake (PTL). (Dapeng Mi, Kan Liang) - Check dynamic constraints (Kan Liang) - Optimize PEBS extended config (Peter Zijlstra) - cstates: - Remove PC3 support from LunarLake (Zhang Rui) - Add Pantherlake support (Zhang Rui) - Clearwater Forest support (Zide Chen) AMD PMU driver: - x86/amd: Check event before enable to avoid GPF (George Kennedy) Fixes and cleanups: - task_work: Fix NMI race condition (Peter Zijlstra) - perf/x86: Fix NULL event access and potential PEBS record loss (Dapeng Mi) - Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter Zijlstra)" * tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use perf/x86/intel: Optimize PEBS extended config perf/x86/intel: Check PEBS dyn_constraints perf/x86/intel: Add a check for dynamic constraints perf/x86/intel: Add counter group support for arch-PEBS perf/x86/intel: Setup PEBS data configuration and enable legacy groups perf/x86/intel: Update dyn_constraint base on PEBS event precise level perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR perf/x86/intel: Process arch-PEBS records or record fragments perf/x86/intel/ds: Factor out PEBS group processing code to functions perf/x86/intel/ds: Factor out PEBS record processing code to functions perf/x86/intel: Initialize architectural PEBS perf/x86/intel: Correct large PEBS flag check perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call perf/x86: Fix NULL event access and potential PEBS record loss perf/x86: Remove redundant is_x86_event() prototype entry,unwind/deferred: Fix unwind_reset_info() placement unwind_user/x86: Fix arch=um build perf: Support deferred user unwind unwind_user/x86: Teach FP unwind about start of function ...
2025-11-20perf: Fix 0 count issue of cpu-clockDapeng Mi-1/+1
Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e5e79c ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
2025-11-03perf/core: Fix system hang caused by cpu-clock usageDapeng Mi-5/+15
cpu-clock usage by the async-profiler tool can trigger a system hang, which got bisected back to the following commit by Octavia Togami: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") causes this issue The root cause of the hang is that cpu-clock is a special type of SW event which relies on hrtimers. The __perf_event_overflow() callback is invoked from the hrtimer handler for cpu-clock events, and __perf_event_overflow() tries to call cpu_clock_event_stop() to stop the event, which calls htimer_cancel() to cancel the hrtimer. But that's a recursion into the hrtimer code from a hrtimer handler, which (unsurprisingly) deadlocks. To fix this bug, use hrtimer_try_to_cancel() instead, and set the PERF_HES_STOPPED flag, which causes perf_swevent_hrtimer() to stop the event once it sees the PERF_HES_STOPPED flag. [ mingo: Fixed the comments and improved the changelog. ] Closes: https://lore.kernel.org/all/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/ Fixes: 18dbcbfabfff ("perf: Fix the POLL_HUP delivery breakage") Reported-by: Octavia Togami <octavia.togami@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Octavia Togami <octavia.togami@gmail.com> Cc: stable@vger.kernel.org Link: https://github.com/lucko/spark/issues/530 Link: https://patch.msgid.link/20251015051828.12809-1-dapeng1.mi@linux.intel.com
2025-10-29perf: Support deferred user unwindPeter Zijlstra-3/+89
Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
2025-10-14perf/core: Fix MMAP2 event device with backing filesAdrian Hunter-2/+2
Some file systems like FUSE-based ones or overlayfs may record the backing file in struct vm_area_struct vm_file, instead of the user file that the user mmapped. That causes perf to misreport the device major/minor numbers of the file system of the file, and the generation of the file, and potentially other inode details. There is an existing helper file_user_inode() for that situation. Use file_user_inode() instead of file_inode() to get the inode for MMAP2 events. Example: Setup: # cd /root # mkdir test ; cd test ; mkdir lower upper work merged # cp `which cat` lower # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged # perf record -e cycles:u -- /root/test/merged/cat /proc/self/maps ... 55b2c91d0000-55b2c926b000 r-xp 00018000 00:1a 3419 /root/test/merged/cat ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.004 MB perf.data (5 samples) ] # # stat /root/test/merged/cat File: /root/test/merged/cat Size: 1127792 Blocks: 2208 IO Block: 4096 regular file Device: 0,26 Inode: 3419 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2025-09-08 12:23:59.453309624 +0000 Modify: 2025-09-08 12:23:59.454309624 +0000 Change: 2025-09-08 12:23:59.454309624 +0000 Birth: 2025-09-08 12:23:59.453309624 +0000 Before: Device reported 00:02 differs from stat output and /proc/self/maps # perf script --show-mmap-events | grep /root/test/merged/cat cat 377 [-01] 243.078558: PERF_RECORD_MMAP2 377/377: [0x55b2c91d0000(0x9b000) @ 0x18000 00:02 3419 2068525940]: r-xp /root/test/merged/cat After: Device reported 00:1a is the same as stat output and /proc/self/maps # perf script --show-mmap-events | grep /root/test/merged/cat cat 362 [-01] 127.755167: PERF_RECORD_MMAP2 362/362: [0x55ba6e781000(0x9b000) @ 0x18000 00:1a 3419 0]: r-xp /root/test/merged/cat With respect to stable kernels, overlayfs mmap function ovl_mmap() was added in v4.19 but file_user_inode() was not added until v6.8 and never back-ported to stable kernels. FMODE_BACKING that it depends on was added in v6.5. This issue has gone largely unnoticed, so back-porting before v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite version, although in practice the next long term kernel is 6.12. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org # 6.8
2025-10-14perf/core: Fix MMAP event path names with backing filesAdrian Hunter-1/+1
Some file systems like FUSE-based ones or overlayfs may record the backing file in struct vm_area_struct vm_file, instead of the user file that the user mmapped. Since commit def3ae83da02f ("fs: store real path instead of fake path in backing file f_path"), file_path() no longer returns the user file path when applied to a backing file. There is an existing helper file_user_path() for that situation. Use file_user_path() instead of file_path() to get the path for MMAP and MMAP2 events. Example: Setup: # cd /root # mkdir test ; cd test ; mkdir lower upper work merged # cp `which cat` lower # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged # perf record -e intel_pt//u -- /root/test/merged/cat /proc/self/maps ... 55b0ba399000-55b0ba434000 r-xp 00018000 00:1a 3419 /root/test/merged/cat ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.060 MB perf.data ] # Before: File name is wrong (/cat), so decoding fails: # perf script --no-itrace --show-mmap-events cat 367 [016] 100.491492: PERF_RECORD_MMAP2 367/367: [0x55b0ba399000(0x9b000) @ 0x18000 00:02 3419 489959280]: r-xp /cat ... # perf script --itrace=e | wc -l Warning: 19 instruction trace errors 19 # After: File name is correct (/root/test/merged/cat), so decoding is ok: # perf script --no-itrace --show-mmap-events cat 364 [016] 72.153006: PERF_RECORD_MMAP2 364/364: [0x55ce4003d000(0x9b000) @ 0x18000 00:02 3419 3132534314]: r-xp /root/test/merged/cat # perf script --itrace=e # perf script --itrace=e | wc -l 0 # Fixes: def3ae83da02f ("fs: store real path instead of fake path in backing file f_path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org
2025-10-14perf/core: Fix address filter match with backing filesAdrian Hunter-1/+1
It was reported that Intel PT address filters do not work in Docker containers. That relates to the use of overlayfs. overlayfs records the backing file in struct vm_area_struct vm_file, instead of the user file that the user mmapped. In order for an address filter to match, it must compare to the user file inode. There is an existing helper file_user_inode() for that situation. Use file_user_inode() instead of file_inode() to get the inode for address filter matching. Example: Setup: # cd /root # mkdir test ; cd test ; mkdir lower upper work merged # cp `which cat` lower # mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged # perf record --buildid-mmap -e intel_pt//u --filter 'filter * @ /root/test/merged/cat' -- /root/test/merged/cat /proc/self/maps ... 55d61d246000-55d61d2e1000 r-xp 00018000 00:1a 3418 /root/test/merged/cat ... [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.015 MB perf.data ] # perf buildid-cache --add /root/test/merged/cat Before: Address filter does not match so there are no control flow packets # perf script --itrace=e # perf script --itrace=b | wc -l 0 # perf script -D | grep 'TIP.PGE' | wc -l 0 # After: Address filter does match so there are control flow packets # perf script --itrace=e # perf script --itrace=b | wc -l 235 # perf script -D | grep 'TIP.PGE' | wc -l 57 # With respect to stable kernels, overlayfs mmap function ovl_mmap() was added in v4.19 but file_user_inode() was not added until v6.8 and never back-ported to stable kernels. FMODE_BACKING that it depends on was added in v6.5. This issue has gone largely unnoticed, so back-porting before v6.8 is probably not worth it, so put 6.8 as the stable kernel prerequisite version, although in practice the next long term kernel is 6.12. Closes: https://lore.kernel.org/linux-perf-users/aBCwoq7w8ohBRQCh@fremen.lan Reported-by: Edd Barrett <edd@theunixzoo.co.uk> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@vger.kernel.org # 6.8
2025-10-14uprobe: Move arch_uprobe_optimize right after handlers executionJiri Olsa-3/+3
It's less confusing to optimize uprobe right after handlers execution and before we do the check for changed ip register to avoid situations where changed ip register would skip uprobe optimization. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds-16/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-01Merge tag 'driver-core-6.18-rc1' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "Auxiliary: - Drop call to dev_pm_domain_detach() in auxiliary_bus_probe() - Optimize logic of auxiliary_match_id() Rust: - Auxiliary: - Use primitive C types from prelude - DebugFs: - Add debugfs support for simple read/write files and custom callbacks through a File-type-based and directory-scope-based API - Sample driver code for the File-type-based API - Sample module code for the directory-scope-based API - I/O: - Add io::poll module and implement Rust specific read_poll_timeout() helper - IRQ: - Implement support for threaded and non-threaded device IRQs based on (&Device<Bound>, IRQ number) tuples (IrqRequest) - Provide &Device<Bound> cookie in IRQ handlers - PCI: - Support IRQ requests from IRQ vectors for a specific pci::Device<Bound> - Implement accessors for subsystem IDs, revision, devid and resource start - Provide dedicated pci::Vendor and pci::Class types for vendor and class ID numbers - Implement Display to print actual vendor and class names; Debug to print the raw ID numbers - Add pci::DeviceId::from_class_and_vendor() helper - Use primitive C types from prelude - Various minor inline and (safety) comment improvements - Platform: - Support IRQ requests from IRQ vectors for a specific platform::Device<Bound> - Nova: - Use pci::DeviceId::from_class_and_vendor() to avoid probing non-display/compute PCI functions - Misc: - Add helper for cpu_relax() - Update ARef import from sync::aref sysfs: - Remove bin_attrs_new field from struct attribute_group - Remove read_new() and write_new() from struct bin_attribute Misc: - Document potential race condition in get_dev_from_fwnode() - Constify node_group argument in software node registration functions - Fix order of kernel-doc parameters in various functions - Set power.no_pm flag for faux devices - Set power.no_callbacks flag along with the power.no_pm flag - Constify the pmu_bus bus type - Minor spelling fixes" * tag 'driver-core-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (43 commits) rust: pci: display symbolic PCI vendor names rust: pci: display symbolic PCI class names rust: pci: fix incorrect platform reference in PCI driver probe doc comment rust: pci: fix incorrect platform reference in PCI driver unbind doc comment perf: make pmu_bus const samples: rust: Add scoped debugfs sample driver rust: debugfs: Add support for scoped directories samples: rust: Add debugfs sample driver rust: debugfs: Add support for callback-based files rust: debugfs: Add support for writable files rust: debugfs: Add support for read-only files rust: debugfs: Add initial support for directories driver core: auxiliary bus: Optimize logic of auxiliary_match_id() driver core: auxiliary bus: Drop dev_pm_domain_detach() call driver core: Fix order of the kernel-doc parameters driver core: get_dev_from_fwnode(): document potential race drivers: base: fix "publically"->"publicly" driver core/PM: Set power.no_callbacks along with power.no_pm driver core: faux: Set power.no_pm for faux devices rust: pci: inline several tiny functions ...