summaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorLines
12 daysMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfHEADmasterLinus Torvalds-66/+321
Pull bpf fixes from Alexei Starovoitov: - Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad Tabba) - Fix invariant violation for single value tnums in the verifier (Harishankar Vishwanathan, Paul Chaignon) - Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai) - Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen) - Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri Olsa) - Fix race in freeing special fields in BPF maps to prevent memory leaks (Kumar Kartikeya Dwivedi) - Fix OOB read in dmabuf_collector (T.J. Mercier) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits) selftests/bpf: Avoid simplification of crafted bounds test selftests/bpf: Test refinement of single-value tnum bpf: Improve bounds when tnum has a single possible value bpf: Introduce tnum_step to step through tnum's members bpf: Fix race in devmap on PREEMPT_RT bpf: Fix race in cpumap on PREEMPT_RT selftests/bpf: Add tests for special fields races bpf: Retire rcu_trace_implies_rcu_gp() from local storage bpf: Delay freeing fields in local storage bpf: Lose const-ness of map in map_check_btf() bpf: Register dtor for freeing special fields selftests/bpf: Fix OOB read in dmabuf_collector selftests/bpf: Fix a memory leak in xdp_flowtable test bpf: Fix stack-out-of-bounds write in devmap bpf: Fix kprobe_multi cookies access in show_fdinfo callback bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing selftests/bpf: Don't override SIGSEGV handler with ASAN selftests/bpf: Check BPFTOOL env var in detect_bpftool_path() selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN selftests/bpf: Fix array bounds warning in jit_disasm_helpers ...
13 daysbpf: Improve bounds when tnum has a single possible valuePaul Chaignon-0/+30
We're hitting an invariant violation in Cilium that sometimes leads to BPF programs being rejected and Cilium failing to start [1]. The following extract from verifier logs shows what's happening: from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 ; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337 236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 237: (16) if w9 == 0xf00 goto pc+1 verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0) We reach instruction 236 with two possible values for R9, 0xe00 and 0xf00. This is perfectly reflected in the tnum, but of course the ranges are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path at instruction 236 allows the verifier to reduce the range to [0xe01; 0xf00]. The tnum is however not updated. With these ranges, at instruction 237, the verifier is not able to deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is explored first, the verifier refines the bounds using the assumption that R9 != 0xf00, and ends up with an invariant violation. This pattern of impossible branch + bounds refinement is common to all invariant violations seen so far. The long-term solution is likely to rely on the refinement + invariant violation check to detect dead branches, as started by Eduard. To fix the current issue, we need something with less refactoring that we can backport. This patch uses the tnum_step helper introduced in the previous patch to detect the above situation. In particular, three cases are now detected in the bounds refinement: 1. The u64 range and the tnum only overlap in umin. u64: ---[xxxxxx]----- tnum: --xx----------x- 2. The u64 range and the tnum only overlap in the maximum value represented by the tnum, called tmax. u64: ---[xxxxxx]----- tnum: xx-----x-------- 3. The u64 range and the tnum only overlap in between umin (excluded) and umax. u64: ---[xxxxxx]----- tnum: xx----x-------x- To detect these three cases, we call tnum_step(tnum, umin), which returns the smallest member of the tnum greater than umin, called tnum_next here. We're in case (1) if umin is part of the tnum and tnum_next is greater than umax. We're in case (2) if umin is not part of the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if umin is not part of the tnum, tnum_next is inferior or equal to umax, and calling tnum_step a second time gives us a value past umax. This change implements these three cases. With it, the above bytecode looks as follows: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (47) r0 |= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff)) 2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840 4: (15) if r0 == 0xf00 goto pc+1 4: R0=3840 6: (95) exit In addition to the new selftests, this change was also verified with Agni [3]. For the record, the raw SMT is available at [4]. The property it verifies is that: If a concrete value x is contained in all input abstract values, after __update_reg_bounds, it will continue to be contained in all output abstract values. Link: https://github.com/cilium/cilium/issues/44216 [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Link: https://github.com/bpfverif/agni [3] Link: https://pastebin.com/raw/naCfaqNx [4] Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Introduce tnum_step to step through tnum's membersHarishankar Vishwanathan-0/+56
This commit introduces tnum_step(), a function that, when given t, and a number z returns the smallest member of t larger than z. The number z must be greater or equal to the smallest member of t and less than the largest member of t. The first step is to compute j, a number that keeps all of t's known bits, and matches all unknown bits to z's bits. Since j is a member of the t, it is already a candidate for result. However, we want our result to be (minimally) greater than z. There are only two possible cases: (1) Case j <= z. In this case, we want to increase the value of j and make it > z. (2) Case j > z. In this case, we want to decrease the value of j while keeping it > z. (Case 1) j <= z t = xx11x0x0 z = 10111101 (189) j = 10111000 (184) ^ k (Case 1.1) Let's first consider the case where j < z. We will address j == z later. Since z > j, there had to be a bit position that was 1 in z and a 0 in j, beyond which all positions of higher significance are equal in j and z. Further, this position could not have been unknown in a, because the unknown positions of a match z. This position had to be a 1 in z and known 0 in t. Let k be position of the most significant 1-to-0 flip. In our example, k = 3 (starting the count at 1 at the least significant bit). Setting (to 1) the unknown bits of t in positions of significance smaller than k will not produce a result > z. Hence, we must set/unset the unknown bits at positions of significance higher than k. Specifically, we look for the next larger combination of 1s and 0s to place in those positions, relative to the combination that exists in z. We can achieve this by concatenating bits at unknown positions of t into an integer, adding 1, and writing the bits of that result back into the corresponding bit positions previously extracted from z. >From our example, considering only positions of significance greater than k: t = xx..x z = 10..1 + 1 ----- 11..0 This is the exact combination 1s and 0s we need at the unknown bits of t in positions of significance greater than k. Further, our result must only increase the value minimally above z. Hence, unknown bits in positions of significance smaller than k should remain 0. We finally have, result = 11110000 (240) (Case 1.2) Now consider the case when j = z, for example t = 1x1x0xxx z = 10110100 (180) j = 10110100 (180) Matching the unknown bits of the t to the bits of z yielded exactly z. To produce a number greater than z, we must set/unset the unknown bits in t, and *all* the unknown bits of t candidates for being set/unset. We can do this similar to Case 1.1, by adding 1 to the bits extracted from the masked bit positions of z. Essentially, this case is equivalent to Case 1.1, with k = 0. t = 1x1x0xxx z = .0.1.100 + 1 --------- .0.1.101 This is the exact combination of bits needed in the unknown positions of t. After recalling the known positions of t, we get result = 10110101 (181) (Case 2) j > z t = x00010x1 z = 10000010 (130) j = 10001011 (139) ^ k Since j > z, there had to be a bit position which was 0 in z, and a 1 in j, beyond which all positions of higher significance are equal in j and z. This position had to be a 0 in z and known 1 in t. Let k be the position of the most significant 0-to-1 flip. In our example, k = 4. Because of the 0-to-1 flip at position k, a member of t can become greater than z if the bits in positions greater than k are themselves >= to z. To make that member *minimally* greater than z, the bits in positions greater than k must be exactly = z. Hence, we simply match all of t's unknown bits in positions more significant than k to z's bits. In positions less significant than k, we set all t's unknown bits to 0 to retain minimality. In our example, in positions of greater significance than k (=4), t=x000. These positions are matched with z (1000) to produce 1000. In positions of lower significance than k, t=10x1. All unknown bits are set to 0 to produce 1001. The final result is: result = 10001001 (137) This concludes the computation for a result > z that is a member of t. The procedure for tnum_step() in this commit implements the idea described above. As a proof of correctness, we verified the algorithm against a logical specification of tnum_step. The specification asserts the following about the inputs t, z and output res that: 1. res is a member of t, and 2. res is strictly greater than z, and 3. there does not exist another value res2 such that 3a. res2 is also a member of t, and 3b. res2 is greater than z 3c. res2 is smaller than res We checked the implementation against this logical specification using an SMT solver. The verification formula in SMTLIB format is available at [1]. The verification returned an "unsat": indicating that no input assignment exists for which the implementation and the specification produce different outputs. In addition, we also automatically generated the logical encoding of the C implementation using Agni [2] and verified it against the same specification. This verification also returned an "unsat", confirming that the implementation is equivalent to the specification. The formula for this check is also available at [3]. Link: https://pastebin.com/raw/2eRWbiit [1] Link: https://github.com/bpfverif/agni [2] Link: https://pastebin.com/raw/EztVbBJ2 [3] Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Fix race in devmap on PREEMPT_RTJiayuan Chen-4/+21
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __dev_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_xmit_all(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames. If preempted after the snapshot, a second task can call bq_enqueue() -> bq_xmit_all() on the same bq, transmitting (and freeing) the same frames. When the first task resumes, it operates on stale pointers in bq->q[], causing use-after-free. 2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying bq->count and bq->q[] while bq_xmit_all() is reading them. 3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and bq->xdp_prog after bq_xmit_all(). If preempted between bq_xmit_all() return and bq->dev_rx = NULL, a preempting bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to the flush_list, and enqueues a frame. When __dev_flush() resumes, it clears dev_rx and removes bq from the flush_list, orphaning the newly enqueued frame. 4. __list_del_clearprev() on flush_node: similar to the cpumap race, both tasks can call __list_del_clearprev() on the same flush_node, the second dereferences the prev pointer already set to NULL. The race between task A (__dev_flush -> bq_xmit_all) and task B (bq_enqueue -> bq_xmit_all) on the same CPU: Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect) ---------------------- -------------------------------- __dev_flush(flush_list) bq_xmit_all(bq) cnt = bq->count /* e.g. 16 */ /* start iterating bq->q[] */ <-- CFS preempts Task A --> bq_enqueue(dev, xdpf) bq->count == DEV_MAP_BULK_SIZE bq_xmit_all(bq, 0) cnt = bq->count /* same 16! */ ndo_xdp_xmit(bq->q[]) /* frames freed by driver */ bq->count = 0 <-- Task A resumes --> ndo_xdp_xmit(bq->q[]) /* use-after-free: frames already freed! */ Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring it in bq_enqueue() and __dev_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Fix race in cpumap on PREEMPT_RTJiayuan Chen-2/+15
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __cpu_map_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_flush_to_queue(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double __list_del_clearprev(): after bq->count is reset in bq_flush_to_queue(), a preempting task can call bq_enqueue() -> bq_flush_to_queue() on the same bq when bq->count reaches CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev() on the same bq->flush_node, the second call dereferences the prev pointer that was already set to NULL by the first. 2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt the packet queue while bq_flush_to_queue() is processing it. The race between task A (__cpu_map_flush -> bq_flush_to_queue) and task B (bq_enqueue -> bq_flush_to_queue) on the same CPU: Task A (xdp_do_flush) Task B (cpu_map_enqueue) ---------------------- ------------------------ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush bq->q[] to ptr_ring */ bq->count = 0 spin_unlock(&q->producer_lock) bq_enqueue(rcpu, xdpf) <-- CFS preempts Task A --> bq->q[bq->count++] = xdpf /* ... more enqueues until full ... */ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush to ptr_ring */ spin_unlock(&q->producer_lock) __list_del_clearprev(flush_node) /* sets flush_node.prev = NULL */ <-- Task A resumes --> __list_del_clearprev(flush_node) flush_node.prev->next = ... /* prev is NULL -> kernel oops */ Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it in bq_enqueue() and __cpu_map_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. To reproduce, insert an mdelay(100) between bq->count = 0 and __list_del_clearprev() in bq_flush_to_queue(), then run reproducer provided by syzkaller. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Retire rcu_trace_implies_rcu_gp() from local storageKumar Kartikeya Dwivedi-18/+19
This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Delay freeing fields in local storageKumar Kartikeya Dwivedi-19/+21
Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since migration should be disabled for RCU callbacks already. Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Lose const-ness of map in map_check_btf()Kumar Kartikeya Dwivedi-13/+12
BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Register dtor for freeing special fieldsKumar Kartikeya Dwivedi-11/+134
There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. Note that the destructor registration happens in map_check_btf, after the BTF record is populated and (at that point) avaiable for inspection and duplication. Duplication is necessary since the freeing of embedded bpf_mem_alloc can be decoupled from actual map lifetime due to logic introduced to reduce the cost of rcu_barrier()s in mem alloc free path in 9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc."). As such, once all callbacks are done, we must also free the duplicated record. To remove dependency on the bpf_map itself, also stash the key size of the map to obtain value from htab_elem long after the map is gone. [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of ↵Linus Torvalds-16/+25
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes. 7 are cc:stable. 8 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update Yosry Ahmed's email address mailmap: add entry for Daniele Alessandrelli mm: fix NULL NODE_DATA dereference for memoryless nodes on boot mm/tracing: rss_stat: ensure curr is false from kthread context mm/kfence: fix KASAN hardware tag faults during late enablement mm/damon/core: disallow non-power of two min_region_sz Squashfs: check metadata block offset is within range MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka liveupdate: luo_file: remember retrieve() status mm: thp: deny THP for files on anonymous inodes mm: change vma_alloc_folio_noprof() macro to inline function mm/kfence: disable KFENCE upon KASAN HW tags enablement
2026-02-26Merge tag 'dma-mapping-7.0-2026-02-26' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko and Stian Halseth)" * tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: sparc: Fix page alignment in dma mapping dma-mapping: avoid random addr value print out on error path
2026-02-26bpf: Fix stack-out-of-bounds write in devmapKohei Enju-5/+17
get_upper_ifindexes() iterates over all upper devices and writes their indices into an array without checking bounds. Also the callers assume that the max number of upper devices is MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack, but that assumption is not correct and the number of upper devices could be larger than MAX_NEST_DEV (e.g., many macvlans), causing a stack-out-of-bounds write. Add a max parameter to get_upper_ifindexes() to avoid the issue. When there are too many upper devices, return -EOVERFLOW and abort the redirect. To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS. Then send a packet to the device to trigger the XDP redirect path. Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/ Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26bpf: Fix kprobe_multi cookies access in show_fdinfo callbackJiri Olsa-1/+3
We don't check if cookies are available on the kprobe_multi link before accessing them in show_fdinfo callback, we should. Cc: stable@vger.kernel.org Fixes: da7e9c0a7fbc ("bpf: Add show_fdinfo for kprobe_multi") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260225111249.186230-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26Merge tag 'kmalloc_obj-v7.0-rc2' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj fixes from Kees Cook: - Fix pointer-to-array allocation types for ubd and kcsan - Force size overflow helpers to __always_inline - Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor) * tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kcsan: test: Adjust "expect" allocation type for kmalloc_obj overflow: Make sure size helpers are always inlined init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref ubd: Use pointer-to-pointers for io_thread_req arrays
2026-02-26kcsan: test: Adjust "expect" allocation type for kmalloc_objKees Cook-1/+1
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]", a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]", the decayed pointer type, as if it were observed.lines itself (though without the "3" bounds). This produces the following build error: ../kernel/kcsan/kcsan_test.c: In function '__report_matches': ../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]' [-Wincompatible-pointer-types] 171 | expect = kmalloc_obj(observed.lines); | ^ Instead of changing the "expect" type to "char (*)[3][512]" and requiring a dereference at each use (e.g. "(expect*)[0]"), just explicitly cast the return to the desired type. Note that I'm intentionally not switching back to byte-based "kmalloc" here because I cannot find a way for the Coccinelle script (which will be used going forward to catch future conversions) to exclude this case. Tested with: $ ./tools/testing/kunit/kunit.py run \ --kconfig_add CONFIG_DEBUG_KERNEL=y \ --kconfig_add CONFIG_KCSAN=y \ --kconfig_add CONFIG_KCSAN_KUNIT_TEST=y \ --arch=x86_64 --qemu_args '-smp 2' kcsan Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-25Merge tag 'vfs-7.0-rc2.fixes' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get
2026-02-24liveupdate: luo_file: remember retrieve() statusPratyush Yadav (Google)-16/+25
LUO keeps track of successful retrieve attempts on a LUO file. It does so to avoid multiple retrievals of the same file. Multiple retrievals cause problems because once the file is retrieved, the serialized data structures are likely freed and the file is likely in a very different state from what the code expects. The retrieve boolean in struct luo_file keeps track of this, and is passed to the finish callback so it knows what work was already done and what it has left to do. All this works well when retrieve succeeds. When it fails, luo_retrieve_file() returns the error immediately, without ever storing anywhere that a retrieve was attempted or what its error code was. This results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace, but nothing prevents it from trying this again. The retry is problematic for much of the same reasons listed above. The file is likely in a very different state than what the retrieve logic normally expects, and it might even have freed some serialization data structures. Attempting to access them or free them again is going to break things. For example, if memfd managed to restore 8 of its 10 folios, but fails on the 9th, a subsequent retrieve attempt will try to call kho_restore_folio() on the first folio again, and that will fail with a warning since it is an invalid operation. Apart from the retry, finish() also breaks. Since on failure the retrieved bool in luo_file is never touched, the finish() call on session close will tell the file handler that retrieve was never attempted, and it will try to access or free the data structures that might not exist, much in the same way as the retry attempt. There is no sane way of attempting the retrieve again. Remember the error retrieve returned and directly return it on a retry. Also pass this status code to finish() so it can make the right decision on the work it needs to do. This is done by changing the bool to an integer. A value of 0 means retrieve was never attempted, a positive value means it succeeded, and a negative value means it failed and the error code is the value. Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org Fixes: 7c722a7f44e0 ("liveupdate: luo_file: implement file systems callbacks") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-23Remove WARN_ALL_UNSEEDED_RANDOM kernel config optionLinus Torvalds-1/+0
This config option goes way back - it used to be an internal debug option to random.c (at that point called DEBUG_RANDOM_BOOT), then was renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM, and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM. It was all done with the best of intentions: the more limited rate-limited reports were reporting some cases, but if you wanted to see all the gory details, you'd enable this "ALL" option. However, it turns out - perhaps not surprisingly - that when people don't care about and fix the first rate-limited cases, they most certainly don't care about any others either, and so warning about all of them isn't actually helping anything. And the non-ratelimited reporting causes problems, where well-meaning people enable debug options, but the excessive flood of messages that nobody cares about will hide actual real information when things go wrong. I just got a kernel bug report (which had nothing to do with randomness) where two thirds of the the truncated dmesg was just variations of random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0 and in the process early boot messages had been lost (in addition to making the messages that _hadn't_ been lost harder to read). The proper way to find these things for the hypothetical developer that cares - if such a person exists - is almost certainly with boot time tracing. That gives you the option to get call graphs etc too, which is likely a requirement for fixing any problems anyway. See Documentation/trace/boottime-trace.rst for that option. And if we for some reason do want to re-introduce actual printing of these things, it will need to have some uniqueness filtering rather than this "just print it all" model. Fixes: cc1e127bfa95 ("random: remove ratelimiting for in-kernel unseeded randomness") Acked-by: Jason Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-23dma-mapping: avoid random addr value print out on error pathJiri Pirko-1/+1
dma_addr is unitialized in dma_direct_map_phys() when swiotlb is forced and DMA_ATTR_MMIO is set which leads to random value print out in warning. Fix that by just returning DMA_MAPPING_ERROR. Fixes: e53d29f957b3 ("dma-mapping: convert dma_direct_*map_page to be phys_addr_t based") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260209153809.250835-2-jiri@resnulli.us
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook-21/+18
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds-32/+16
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds-20/+20
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds-337/+337
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook-551/+543
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-20Merge tag 'trace-v7.0-2' of ↵Linus Torvalds-5/+23
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible dereference of uninitialized pointer When validating the persistent ring buffer on boot up, if the first validation fails, a reference to "head_page" is performed in the error path, but it skips over the initialization of that variable. Move the initialization before the first validation check. - Fix use of event length in validation of persistent ring buffer On boot up, the persistent ring buffer is checked to see if it is valid by several methods. One being to walk all the events in the memory location to make sure they are all valid. The length of the event is used to move to the next event. This length is determined by the data in the buffer. If that length is corrupted, it could possibly make the next event to check located at a bad memory location. Validate the length field of the event when doing the event walk. - Fix function graph on archs that do not support use of ftrace_ops When an architecture defines HAVE_DYNAMIC_FTRACE_WITH_ARGS, it means that its function graph tracer uses the ftrace_ops of the function tracer to call its callbacks. This allows a single registered callback to be called directly instead of checking the callback's meta data's hash entries against the function being traced. For architectures that do not support this feature, it must always call the loop function that tests each registered callback (even if there's only one). The loop function tests each callback's meta data against its hash of functions and will call its callback if the function being traced is in its hash map. The issue was that there was no check against this and the direct function was being called even if the architecture didn't support it. This meant that if function tracing was enabled at the same time as a callback was registered with the function graph tracer, its callback would be called for every function that the function tracer also traced, even if the callback's meta data only wanted to be called back for a small subset of functions. Prevent the direct calling for those architectures that do not support it. - Fix references to trace_event_file for hist files The hist files used event_file_data() to get a reference to the associated trace_event_file the histogram was attached to. This would return a pointer even if the trace_event_file is about to be freed (via RCU). Instead it should use the event_file_file() helper that returns NULL if the trace_event_file is marked to be freed so that no new references are added to it. - Wake up hist poll readers when an event is being freed When polling on a hist file, the task is only awoken when a hist trigger is triggered. This means that if an event is being freed while there's a task waiting on its hist file, it will need to wait until the hist trigger occurs to wake it up and allow the freeing to happen. Note, the event will not be completely freed until all references are removed, and a hist poller keeps a reference. But it should still be woken when the event is being freed. * tag 'trace-v7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Wake up poll waiters for hist files when removing an event tracing: Fix checking of freed trace_event_file for hist files fgraph: Do not call handlers direct when not using ftrace_ops tracing: ring-buffer: Fix to check event length before using ring-buffer: Fix possible dereference of uninitialized pointer
2026-02-19tracing: Wake up poll waiters for hist files when removing an eventPetr Pavlu-0/+3
The event_hist_poll() function attempts to verify whether an event file is being removed, but this check may not occur or could be unnecessarily delayed. This happens because hist_poll_wakeup() is currently invoked only from event_hist_trigger() when a hist command is triggered. If the event file is being removed, no associated hist command will be triggered and a waiter will be woken up only after an unrelated hist command is triggered. Fix the issue by adding a call to hist_poll_wakeup() in remove_event_file_dir() after setting the EVENT_FILE_FL_FREED flag. This ensures that a task polling on a hist file is woken up and receives EPOLLERR. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-3-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: Fix checking of freed trace_event_file for hist filesPetr Pavlu-2/+2
The event_hist_open() and event_hist_poll() functions currently retrieve a trace_event_file pointer from a file struct by invoking event_file_data(), which simply returns file->f_inode->i_private. The functions then check if the pointer is NULL to determine whether the event is still valid. This approach is flawed because i_private is assigned when an eventfs inode is allocated and remains set throughout its lifetime. Instead, the code should call event_file_file(), which checks for EVENT_FILE_FL_FREED. Using the incorrect access function may result in the code potentially opening a hist file for an event that is being removed or becoming stuck while polling on this file. Correct the access method to event_file_file() in both functions. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-2-petr.pavlu@suse.com Fixes: 1bd13edbbed6 ("tracing/hist: Add poll(POLLIN) support on hist file") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19fgraph: Do not call handlers direct when not using ftrace_opsSteven Rostedt-1/+11
The function graph tracer was modified to us the ftrace_ops of the function tracer. This simplified the code as well as allowed more features of the function graph tracer. Not all architectures were converted over as it required the implementation of HAVE_DYNAMIC_FTRACE_WITH_ARGS to implement. For those architectures, it still did it the old way where the function graph tracer handle was called by the function tracer trampoline. The handler then had to check the hash to see if the registered handlers wanted to be called by that function or not. In order to speed up the function graph tracer that used ftrace_ops, if only one callback was registered with function graph, it would call its function directly via a static call. Now, if the architecture does not support the use of using ftrace_ops and still has the ftrace function trampoline calling the function graph handler, then by doing a direct call it removes the check against the handler's hash (list of functions it wants callbacks to), and it may call that handler for functions that the handler did not request calls for. On 32bit x86, which does not support the ftrace_ops use with function graph tracer, it shows the issue: ~# trace-cmd start -p function -l schedule ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) * 11898.94 us | schedule(); 3) # 1783.041 us | schedule(); 1) | schedule() { ------------------------------------------ 1) bash-8369 => kworker-7669 ------------------------------------------ 1) | schedule() { ------------------------------------------ 1) kworker-7669 => bash-8369 ------------------------------------------ 1) + 97.004 us | } 1) | schedule() { [..] Now by starting the function tracer is another instance: ~# trace-cmd start -B foo -p function This causes the function graph tracer to trace all functions (because the function trace calls the function graph tracer for each on, and the function graph trace is doing a direct call): ~# trace-cmd show # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) 1.669 us | } /* preempt_count_sub */ 1) + 10.443 us | } /* _raw_spin_unlock_irqrestore */ 1) | tick_program_event() { 1) | clockevents_program_event() { 1) 1.044 us | ktime_get(); 1) 6.481 us | lapic_next_event(); 1) + 10.114 us | } 1) + 11.790 us | } 1) ! 181.223 us | } /* hrtimer_interrupt */ 1) ! 184.624 us | } /* __sysvec_apic_timer_interrupt */ 1) | irq_exit_rcu() { 1) 0.678 us | preempt_count_sub(); When it should still only be tracing the schedule() function. To fix this, add a macro FGRAPH_NO_DIRECT to be set to 0 when the architecture does not support function graph use of ftrace_ops, and set to 1 otherwise. Then use this macro to know to allow function graph tracer to call the handlers directly or not. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://patch.msgid.link/20260218104244.5f14dade@gandalf.local.home Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19tracing: ring-buffer: Fix to check event length before usingMasami Hiramatsu (Google)-1/+5
Check the event length before adding it for accessing next index in rb_read_data_buffer(). Since this function is used for validating possibly broken ring buffers, the length of the event could be broken. In that case, the new event (e + len) can point a wrong address. To avoid invalid memory access at boot, check whether the length of each event is in the possible range before using it. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Link: https://patch.msgid.link/177123421541.142205.9414352170164678966.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19ring-buffer: Fix possible dereference of uninitialized pointerDaniil Dulov-1/+2
There is a pointer head_page in rb_meta_validate_events() which is not initialized at the beginning of a function. This pointer can be dereferenced if there is a failure during reader page validation. In this case the control is passed to "invalid" label where the pointer is dereferenced in a loop. To fix the issue initialize orig_head and head_page before calling rb_validate_buffer. Found by Linux Verification Center (linuxtesting.org) with SVACE. Cc: stable@vger.kernel.org Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260213100130.2013839-1-d.dulov@aladdin.ru Closes: https://lore.kernel.org/r/202406130130.JtTGRf7W-lkp@intel.com/ Fixes: 5f3b6e839f3c ("ring-buffer: Validate boot range memory events") Signed-off-by: Daniil Dulov <d.dulov@aladdin.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-19Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds-23/+25
Pull bpf fixes from Alexei Starovoitov: - Fix invalid write loop logic in libbpf's bpf_linker__add_buf() (Amery Hung) - Fix a potential use-after-free of BTF object (Anton Protopopov) - Add feature detection to libbpf and avoid moving arena global variables on older kernels (Emil Tsalapatis) - Remove extern declaration of bpf_stream_vprintk() from libbpf headers (Ihor Solodrai) - Fix truncated netlink dumps in bpftool (Jakub Kicinski) - Fix map_kptr grace period wait in bpf selftests (Kumar Kartikeya Dwivedi) - Remove hexdump dependency while building bpf selftests (Matthieu Baerts) - Complete fsession support in BPF trampolines on riscv (Menglong Dong) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Remove hexdump dependency libbpf: Remove extern declaration of bpf_stream_vprintk() selftests/bpf: Use vmlinux.h in test_xdp_meta bpftool: Fix truncated netlink dumps libbpf: Delay feature gate check until object prepare time libbpf: Do not use PROG_TYPE_TRACEPOINT program for feature gating bpf: Add a map/btf from a fd array more consistently selftests/bpf: Fix map_kptr grace period wait selftests/bpf: enable fsession_test on riscv64 selftests/bpf: Adjust selftest due to function rename bpf, riscv: add fsession support for trampolines bpf: Fix a potential use-after-free of BTF object bpf, riscv: introduce emit_store_stack_imm64() for trampoline libbpf: Fix invalid write loop logic in bpf_linker__add_buf() libbpf: Add gating for arena globals relocation feature
2026-02-18Merge tag 'mm-nonmm-stable-2026-02-18-19-56' of ↵Linus Torvalds-13/+15
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-MM updates from Andrew Morton: - "two fixes in kho_populate()" fixes a couple of not-major issues in the kexec handover code (Ran Xiaokai) - misc singletons * tag 'mm-nonmm-stable-2026-02-18-19-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: lib/group_cpus: handle const qualifier from clusters allocation type kho: remove unnecessary WARN_ON(err) in kho_populate() kho: fix missing early_memunmap() call in kho_populate() scripts/gdb: implement x86_page_ops in mm.py objpool: fix the overestimation of object pooling metadata size selftests/memfd: use IPC semaphore instead of SIGSTOP/SIGCONT delayacct: fix build regression on accounting tool
2026-02-18Merge tag 'mm-stable-2026-02-18-19-48' of ↵Linus Torvalds-19/+37
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...
2026-02-18Merge tag 'sysctl-7.00-rc1' of ↵Linus Torvalds-42/+385
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove macros from proc handler converters Replace the proc converter macros with "regular" functions. Though it is more verbose than the macro version, it helps when debugging and better aligns with coding-style.rst. - General cleanup Remove superfluous ctl_table forward declarations. Const qualify the memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays. Add missing kernel doc to proc_dointvec_conv. - Testing This series was run through sysctl selftests/kunit test suite in x86_64. And went into linux-next after rc4, giving it a good 3 weeks of testing * tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions sysctl: Replace unidirectional INT converter macros with functions sysctl: Add kernel doc to proc_douintvec_conv sysctl: Replace UINT converter macros with functions sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros sysctl: clarify proc_douintvec_minmax doc sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n sysctl: Remove unused ctl_table forward declarations loadpin: Implement custom proc_handler for enforce alloc_tag: move memory_allocation_profiling_sysctls into .rodata sysctl: Add missing kernel-doc for proc_dointvec_conv
2026-02-18unshare: fix unshare_fs() handlingAl Viro-1/+1
There's an unpleasant corner case in unshare(2), when we have a CLONE_NEWNS in flags and current->fs hadn't been shared at all; in that case copy_mnt_ns() gets passed current->fs instead of a private copy, which causes interesting warts in proof of correctness] > I guess if private means fs->users == 1, the condition could still be true. Unfortunately, it's worse than just a convoluted proof of correctness. Consider the case when we have CLONE_NEWCGROUP in addition to CLONE_NEWNS (and current->fs->users == 1). We pass current->fs to copy_mnt_ns(), all right. Suppose it succeeds and flips current->fs->{pwd,root} to corresponding locations in the new namespace. Now we proceed to copy_cgroup_ns(), which fails (e.g. with -ENOMEM). We call put_mnt_ns() on the namespace created by copy_mnt_ns(), it's destroyed and its mount tree is dissolved, but... current->fs->root and current->fs->pwd are both left pointing to now detached mounts. They are pinning those, so it's not a UAF, but it leaves the calling process with unshare(2) failing with -ENOMEM _and_ leaving it with pwd and root on detached isolated mounts. The last part is clearly a bug. There is other fun related to that mess (races with pivot_root(), including the one between pivot_root() and fork(), of all things), but this one is easy to isolate and fix - treat CLONE_NEWNS as "allocate a new fs_struct even if it hadn't been shared in the first place". Sure, we could go for something like "if both CLONE_NEWNS *and* one of the things that might end up failing after copy_mnt_ns() call in create_new_namespaces() are set, force allocation of new fs_struct", but let's keep it simple - the cost of copy_fs_struct() is trivial. Another benefit is that copy_mnt_ns() with CLONE_NEWNS *always* gets a freshly allocated fs_struct, yet to be attached to anything. That seriously simplifies the analysis... FWIW, that bug had been there since the introduction of unshare(2) ;-/ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://patch.msgid.link/20260207082524.GE3183987@ZenIV Tested-by: Waiman Long <longman@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-17Merge tag 'spdx-7.0-rc1' of ↵Linus Torvalds-31/+11
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull SPDX updates from Greg KH: "Here are two small changes that add some missing SPDX license lines to some core kernel files. These are: - adding SPDX license lines to kdb files - adding SPDX license lines to the remaining kernel/ files Both of these have been in linux-next for a while with no reported issues" * tag 'spdx-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: kernel: debug: Add SPDX license ids to kdb files kernel: add SPDX-License-Identifier lines
2026-02-17Merge tag 'block-7.0-20260216' of ↵Linus Torvalds-17/+21
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more block updates from Jens Axboe: - Fix partial IOVA mapping cleanup in error handling - Minor prep series ignoring discard return value, as the inline value is always known - Ensure BLK_FEAT_STABLE_WRITES is set for drbd - Fix leak of folio in bio_iov_iter_bounce_read() - Allow IOC_PR_READ_* for read-only open - Another debugfs deadlock fix - A few doc updates * tag 'block-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq: use NOIO context to prevent deadlock during debugfs creation blk-stat: convert struct blk_stat_callback to kernel-doc block: fix enum descriptions kernel-doc block: update docs for bio and bvec_iter block: change return type to void nvmet: ignore discard return value md: ignore discard return value block: fix partial IOVA mapping cleanup in blk_rq_dma_map_iova block: fix folio leak in bio_iov_iter_bounce_read() block: allow IOC_PR_READ_* ioctls with BLK_OPEN_READ drbd: always set BLK_FEAT_STABLE_WRITES
2026-02-16Merge tag 'kernel-7.0-rc1.misc' of ↵Linus Torvalds-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - pid: introduce task_ppid_vnr() helper - pidfs: convert rb-tree to rhashtable Mateusz reported performance penalties during task creation because pidfs uses pidmap_lock to add elements into the rbtree. Switch to an rhashtable to have separate fine-grained locking and to decouple from pidmap_lock moving all heavy manipulations outside of it Also move inode allocation outside of pidmap_lock. With this there's nothing happening for pidfs under pidmap_lock - pid: reorder fields in pid_namespace to reduce false sharing - Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" - ipc: Add SPDX license id to mqueue.c * tag 'kernel-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pid: introduce task_ppid_vnr() helper pidfs: implement ino allocation without the pidmap lock Revert "pid: make __task_pid_nr_ns(ns => NULL) safe for zombie callers" pid: reorder fields in pid_namespace to reduce false sharing pidfs: convert rb-tree to rhashtable ipc: Add SPDX license id to mqueue.c
2026-02-16blk-mq: use NOIO context to prevent deadlock during debugfs creationYu Kuai-17/+21
Creating debugfs entries can trigger fs reclaim, which can enter back into the block layer request_queue. This can cause deadlock if the queue is frozen. Previously, a WARN_ON_ONCE check was used in debugfs_create_files() to detect this condition, but it was racy since the queue can be frozen from another context at any time. Introduce blk_debugfs_lock()/blk_debugfs_unlock() helpers that combine the debugfs_mutex with memalloc_noio_save()/restore() to prevent fs reclaim from triggering block I/O. Also add blk_debugfs_lock_nomemsave() and blk_debugfs_unlock_nomemrestore() variants for callers that don't need NOIO protection (e.g., debugfs removal or read-only operations). Replace all raw debugfs_mutex lock/unlock pairs with these helpers, using the _nomemsave/_nomemrestore variants where appropriate. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9gNKEYAPagD9JADfO5UH+OiCr4P7OO2wjpfOYeM-RV=A@mail.gmail.com/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/all/aYWQR7CtYdk3K39g@shinmob/ Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-16Merge tag 'probes-v7.0' of ↵Linus Torvalds-26/+102
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull kprobes updates from Masami Hiramatsu: - Use a dedicated kernel thread to optimize the kprobes instead of using workqueue thread. Since the kprobe optimizer waits a long time for synchronize_rcu_task(), it can block other workers in the same queue if it uses a workqueue. - kprobe-events: return immediately if no new probe events are specified on the kernel command line at boot time. This shortens the kernel boot time. - When a kprobe is fully removed from the kernel code, retry optimizing another kprobe which is blocked by that kprobe. * tag 'probes-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Use dedicated kthread for kprobe optimizer tracing: kprobe-event: Return directly when trace kprobes is empty kprobes: retry blocked optprobe in do_free_cleaned_kprobes
2026-02-14printk, vt, fbcon: Remove console_conditional_schedule()Sebastian Andrzej Siewior-16/+0
do_con_write(), fbcon_redraw.*() invoke console_conditional_schedule() which is a conditional scheduling point based on printk's internal variables console_may_schedule. It may only be used if the console lock is acquired for instance via console_lock() or console_trylock(). Prinkt sets the internal variable to 1 (and allows to schedule) if the console lock has been acquired via console_lock(). The trylock does not allow it. The console_conditional_schedule() invocation in do_con_write() is invoked shortly before console_unlock(). The console_conditional_schedule() invocation in fbcon_redraw.*() original from fbcon_scroll() / vt's con_scroll() which originate from a line feed. In console_unlock() the variable is set to 0 (forbids to schedule) and it tries to schedule while making progress printing. This is brand new compared to when console_conditional_schedule() was added in v2.4.9.11. In v2.6.38-rc3, console_unlock() (started its existence) iterated over all consoles and flushed them with disabled interrupts. A scheduling attempt here was not possible, it relied that a long print scheduled before console_unlock(). Since commit 8d91f8b15361d ("printk: do cond_resched() between lines while outputting to consoles"), which appeared in v4.5-rc1, console_unlock() attempts to schedule if it was allowed to schedule while during console_lock(). Each record is idealy one line so after every line feed. This console_conditional_schedule() is also only relevant on PREEMPT_NONE and PREEMPT_VOLUNTARY builds. In other configurations cond_resched() becomes a nop and has no impact. I'm bringing this all up just proof that it is not required anymore. It becomes a problem on a PREEMPT_RT build with debug code enabled because that might_sleep() in cond_resched() remains and triggers a warnings. This is due to legacy_kthread_func-> console_flush_one_record -> vt_console_print-> lf -> con_scroll -> fbcon_scroll and vt_console_print() acquires a spinlock_t which does not allow a voluntary schedule. There is no need to fb_scroll() to schedule since console_flush_one_record() attempts to schedule after each line. !PREEMPT_RT is not affected because the legacy printing thread is only enabled on PREEMPT_RT builds. Therefore I suggest to remove console_conditional_schedule(). Cc: Simona Vetter <simona@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: linux-fbdev@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Fixes: 5f53ca3ff83b4 ("printk: Implement legacy printer kthread for PREEMPT_RT") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Petr Mladek <pmladek@suse.com> # from printk() POV Signed-off-by: Helge Deller <deller@gmx.de>
2026-02-13Merge tag 'trace-v7.0' of ↵Linus Torvalds-1058/+1322
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Add an entry into MAINTAINERS file for RUST versions of code There's now RUST code for tracing and static branches. To differentiate that code from the C code, add entries in for the RUST version (with "[RUST]" around it) so that the right maintainers get notified on changes. - New bitmask-list option added to tracefs When this is set, bitmasks in trace event are not displayed as hex numbers, but instead as lists: e.g. 0-5,7,9 instead of 0000015f - New show_event_filters file in tracefs Instead of having to search all events/*/*/filter for any active filters enabled in the trace instance, the file show_event_filters will list them so that there's only one file that needs to be examined to see if any filters are active. - New show_event_triggers file in tracefs Instead of having to search all events/*/*/trigger for any active triggers enabled in the trace instance, the file show_event_triggers will list them so that there's only one file that needs to be examined to see if any triggers are active. - Have traceoff_on_warning disable trace pintk buffer too Recently recording of trace_printk() could go to other trace instances instead of the top level instance. But if traceoff_on_warning triggers, it doesn't stop the buffer with trace_printk() and that data can easily be lost by being overwritten. Have traceoff_on_warning also disable the instance that has trace_printk() being written to it. - Update the hist_debug file to show what function the field uses When CONFIG_HIST_TRIGGERS_DEBUG is enabled, a hist_debug file exists for every event. This displays the internal data of any histogram enabled for that event. But it is lacking the function that is called to process one of its fields. This is very useful information that was missing when debugging histograms. - Up the histogram stack size from 16 to 31 Stack traces can be used as keys for event histograms. Currently the size of the stack that is stored is limited to just 16 entries. But the storage space in the histogram is 256 bytes, meaning that it can store up to 31 entries (plus one for the count of entries). Instead of letting that space go to waste, up the limit from 16 to 31. This makes the keys much more useful. - Fix permissions of per CPU file buffer_size_kb The per CPU file of buffer_size_kb was incorrectly set to read only in a previous cleanup. It should be writable. - Reset "last_boot_info" if the persistent buffer is cleared The last_boot_info shows address information of a persistent ring buffer if it contains data from a previous boot. It is cleared when recording starts again, but it is not cleared when the buffer is reset. The data is useless after a reset so clear it on reset too. Internal changes: - A change was made to allow tracepoint callbacks to have preemption enabled, and instead be protected by SRCU. This required some updates to the callbacks for perf and BPF. perf needed to disable preemption directly in its callback because it expects preemption disabled in the later code. BPF needed to disable migration, as its code expects to run completely on the same CPU. - Have irq_work wake up other CPU if current CPU is "isolated" When there's a waiter waiting on ring buffer data and a new event happens, an irq work is triggered to wake up that waiter. This is noisy on isolated CPUs (running NO_HZ_FULL). Trigger an IPI to a house keeping CPU instead. - Use proper free of trigger_data instead of open coding it in. - Remove redundant call of event_trigger_reset_filter() It was called immediately in a function that was called right after it. - Workqueue cleanups - Report errors if tracing_update_buffers() were to fail. - Make the enum update workqueue generic for other parts of tracing On boot up, a work queue is created to convert enum names into their numbers in the trace event format files. This work queue can also be used for other aspects of tracing that takes some time and shouldn't be called by the init call code. The blk_trace initialization takes a bit of time. Have the initialization code moved to the new tracing generic work queue function. - Skip kprobe boot event creation call if there's no kprobes defined on cmdline The kprobe initialization to set up kprobes if they are defined on the cmdline requires taking the event_mutex lock. This can be held by other tracing code doing initialization for a long time. Since kprobes added to the kernel command line need to be setup immediately, as they may be tracing early initialization code, they cannot be postponed in a work queue and must be setup in the initcall code. If there's no kprobe on the kernel cmdline, there's no reason to take the mutex and slow down the boot up code waiting to get the lock only to find out there's nothing to do. Simply exit out early if there's no kprobes on the kernel cmdline. If there are kprobes on the cmdline, then someone cares more about tracing over the speed of boot up. - Clean up the trigger code a bit - Move code out of trace.c and into their own files trace.c is now over 11,000 lines of code and has become more difficult to maintain. Start splitting it up so that related code is in their own files. Move all the trace_printk() related code into trace_printk.c. Move the __always_inline stack functions into trace.h. Move the pid filtering code into a new trace_pid.c file. - Better define the max latency and snapshot code The latency tracers have a "max latency" buffer that is a copy of the main buffer and gets swapped with it when a new high latency is detected. This keeps the trace up to the highest latency around where this max_latency buffer is never written to. It is only used to save the last max latency trace. A while ago a snapshot feature was added to tracefs to allow user space to perform the same logic. It could also enable events to trigger a "snapshot" if one of their fields hit a new high. This was built on top of the latency max_latency buffer logic. Because snapshots came later, they were dependent on the latency tracers to be enabled. In reality, the latency tracers depend on the snapshot code and not the other way around. It was just that they came first. Restructure the code and the kconfigs to have the latency tracers depend on snapshot code instead. This actually simplifies the logic a bit and allows to disable more when the latency tracers are not defined and the snapshot code is. - Fix a "false sharing" in the hwlat tracer code The loop to search for latency in hardware was using a variable that could be changed by user space for each sample. If the user change this variable, it could cause a bus contention, and reading that variable can show up as a large latency in the trace causing a false positive. Read this variable at the start of the sample with a READ_ONCE() into a local variable and keep the code from sharing cache lines with readers. - Fix function graph tracer static branch optimization code When only one tracer is defined for function graph tracing, it uses a static branch to call that tracer directly. When another tracer is added, it goes into loop logic to call all the registered callbacks. The code was incorrect when going back to one tracer and never re-enabled the static branch again to do the optimization code. - And other small fixes and cleanups" * tag 'trace-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (46 commits) function_graph: Restore direct mode when callbacks drop to one tracing: Fix indentation of return statement in print_trace_fmt() tracing: Reset last_boot_info if ring buffer is reset tracing: Fix to set write permission to per-cpu buffer_size_kb tracing: Fix false sharing in hwlat get_sample() tracing: Move d_max_latency out of CONFIG_FSNOTIFY protection tracing: Better separate SNAPSHOT and MAX_TRACE options tracing: Add tracer_uses_snapshot() helper to remove #ifdefs tracing: Rename trace_array field max_buffer to snapshot_buffer tracing: Move pid filtering into trace_pid.c tracing: Move trace_printk functions out of trace.c and into trace_printk.c tracing: Use system_state in trace_printk_init_buffers() tracing: Have trace_printk functions use flags instead of using global_trace tracing: Make tracing_update_buffers() take NULL for global_trace tracing: Make printk_trace global for tracing system tracing: Move ftrace_trace_stack() out of trace.c and into trace.h tracing: Move __trace_buffer_{un}lock_*() functions to trace.h tracing: Make tracing_selftest_running global to the tracing subsystem tracing: Make tracing_disabled global for tracing system tracing: Clean up use of trace_create_maxlat_file() ...
2026-02-13Merge tag 'dma-mapping-7.0-2026-02-13' of ↵Linus Torvalds-12/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping update from Marek Szyprowski: "A small code cleanup for the DMA-mapping subsystem: removal of unused hooks (Robin Murphy)" * tag 'dma-mapping-7.0-2026-02-13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: Remove dma_mark_clean (again)
2026-02-13bpf: Add a map/btf from a fd array more consistentlyAnton Protopopov-2/+4
The add_fd_from_fd_array() function takes a file descriptor as a parameter and tries to add either map or btf to the corresponding list of used objects. As was reported by Dan Carpenter, since the commit c81e4322acf0 ("bpf: Fix a potential use-after-free of BTF object"), the fdget() is called twice on the file descriptor, and thus userspace, potentially, can replace the file pointed to by the file descriptor in between the two calls. On practice, this shouldn't break anything on the kernel side, but for consistency fix the code such that only one fdget() is executed. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/aY689z7gHNv8rgVO@stanley.mountain/ Fixes: ccd2d799ed44 ("bpf: Fix a potential use-after-free of BTF object") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260213212949.759321-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13bpf: Fix a potential use-after-free of BTF objectAnton Protopopov-26/+26
Refcounting in the check_pseudo_btf_id() function is incorrect: the __check_pseudo_btf_id() function might get called with a zero refcounted btf. Fix this, and patch related code accordingly. v3: rephrase a comment (AI) v2: fix a refcount leak introduced in v1 (AI) Reported-by: syzbot+5a0f1995634f7c1dadbf@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5a0f1995634f7c1dadbf Fixes: 76145f725532 ("bpf: Refactor check_pseudo_btf_id") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260209132904.63908-1-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-13Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds-5/+23
Pull virtio updates from Michael Tsirkin: - in-order support in virtio core - multiple address space support in vduse - fixes, cleanups all over the place, notably dma alignment fixes for non-cache-coherent systems * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (59 commits) vduse: avoid adding implicit padding vhost: fix caching attributes of MMIO regions by setting them explicitly vdpa/mlx5: update MAC address handling in mlx5_vdpa_set_attr() vdpa/mlx5: reuse common function for MAC address updates vdpa/mlx5: update mlx_features with driver state check crypto: virtio: Replace package id with numa node id crypto: virtio: Remove duplicated virtqueue_kick in virtio_crypto_skcipher_crypt_req crypto: virtio: Add spinlock protection with virtqueue notification Documentation: Add documentation for VDUSE Address Space IDs vduse: bump version number vduse: add vq group asid support vduse: merge tree search logic of IOTLB_GET_FD and IOTLB_GET_INFO ioctls vduse: take out allocations from vduse_dev_alloc_coherent vduse: remove unused vaddr parameter of vduse_domain_free_coherent vduse: refactor vdpa_dev_add for goto err handling vhost: forbid change vq groups ASID if DRIVER_OK is set vdpa: document set_group_asid thread safety vduse: return internal vq group struct as map token vduse: add vq group support vduse: add v1 API definition ...
2026-02-13function_graph: Restore direct mode when callbacks drop to oneShengming Hu-1/+1
When registering a second fgraph callback, direct path is disabled and array loop is used instead. When ftrace_graph_active falls back to one, we try to re-enable direct mode via ftrace_graph_enable_direct(true, ...). But ftrace_graph_enable_direct() incorrectly disables the static key rather than enabling it. This leaves fgraph_do_direct permanently off after first multi-callback transition, so direct fast mode is never restored. Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260213142932519cuWSpEXeS4-UnCvNXnK2P@zte.com.cn Fixes: cc60ee813b503 ("function_graph: Use static_call and branch to optimize entry function") Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-02-12Merge tag 'riscv-for-linus-7.0-mw1' of ↵Linus Torvalds-0/+30
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Paul Walmsley: - Add support for control flow integrity for userspace processes. This is based on the standard RISC-V ISA extensions Zicfiss and Zicfilp - Improve ptrace behavior regarding vector registers, and add some selftests - Optimize our strlen() assembly - Enable the ISO-8859-1 code page as built-in, similar to ARM64, for EFI volume mounting - Clean up some code slightly, including defining copy_user_page() as copy_page() rather than memcpy(), aligning us with other architectures; and using max3() to slightly simplify an expression in riscv_iommu_init_check() * tag 'riscv-for-linus-7.0-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (42 commits) riscv: lib: optimize strlen loop efficiency selftests: riscv: vstate_exec_nolibc: Use the regular prctl() function selftests: riscv: verify ptrace accepts valid vector csr values selftests: riscv: verify ptrace rejects invalid vector csr inputs selftests: riscv: verify syscalls discard vector context selftests: riscv: verify initial vector state with ptrace selftests: riscv: test ptrace vector interface riscv: ptrace: validate input vector csr registers riscv: csr: define vtype register elements riscv: vector: init vector context with proper vlenb riscv: ptrace: return ENODATA for inactive vector extension kselftest/riscv: add kselftest for user mode CFI riscv: add documentation for shadow stack riscv: add documentation for landing pad / indirect branch tracking riscv: create a Kconfig fragment for shadow stack and landing pad support arch/riscv: add dual vdso creation logic and select vdso based on hw arch/riscv: compile vdso with landing pad and shadow stack note riscv: enable kernel access to shadow stack memory via the FWFT SBI call riscv: add kernel command line option to opt out of user CFI riscv/hwprobe: add zicfilp / zicfiss enumeration in hwprobe ...
2026-02-12Merge tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds-11/+60
Pull CXL updates from Dave Jiang: - Introduce cxl_memdev_attach and pave way for soft reserved handling, type2 accelerator enabling, and LSA 2.0 enabling. All these series require the endpoint driver to settle before continuing the memdev driver probe. - Address CXL port error protocol handling and reporting. The large patch series was split into three parts. The first two parts are included here with the final part coming later. The first part consists of a series of code refactoring to PCI AER sub-system that addresses CXL and also CXL RAS code to prepare for port error handling. The second part refactors the CXL code to move management of component registers to cxl_port objects to allow all CXL AER errors to be handled through the cxl_port hierarchy. - Provide AMD Zen5 platform address translation for CXL using ACPI PRMT. This includes a conventions document to explain why this is needed and how it's implemented. - Misc CXL patches of fixes, cleanups, and updates. Including CXL address translation for unaligned MOD3 regions. [ TLA service: CXL is "Compute Express Link" ] * tag 'cxl-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (59 commits) cxl: Disable HPA/SPA translation handlers for Normalized Addressing cxl/region: Factor out code into cxl_region_setup_poison() cxl/atl: Lock decoders that need address translation cxl: Enable AMD Zen5 address translation using ACPI PRMT cxl/acpi: Prepare use of EFI runtime services cxl: Introduce callback for HPA address ranges translation cxl/region: Use region data to get the root decoder cxl/region: Add @hpa_range argument to function cxl_calc_interleave_pos() cxl/region: Separate region parameter setup and region construction cxl: Simplify cxl_root_ops allocation and handling cxl/region: Store HPA range in struct cxl_region cxl/region: Store root decoder in struct cxl_region cxl/region: Rename misleading variable name @hpa to @hpa_range Documentation/driver-api/cxl: ACPI PRM Address Translation Support and AMD Zen5 enablement cxl, doc: Moving conventions in separate files cxl, doc: Remove isonum.txt inclusion cxl/port: Unify endpoint and switch port lookup cxl/port: Move endpoint component register management to cxl_port cxl/port: Map Port RAS registers cxl/port: Move dport RAS setup to dport add time ...
2026-02-12kho: remove unnecessary WARN_ON(err) in kho_populate()Ran Xiaokai-1/+1
The following pr_warn() provides detailed error and location information, WARN_ON(err) adds no additional debugging value, so remove the redundant WARN_ON() call. Link: https://lkml.kernel.org/r/20260212111146.210086-3-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>