summaryrefslogtreecommitdiffstats
path: root/net
AgeCommit message (Collapse)AuthorLines
2026-05-15Merge tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-clientLinus Torvalds-10/+18
Pull ceph fixes from Ilya Dryomov: "An important patch from Hristo that squashes a folio reference leak that could lead to OOM kills in CephFS and a number of miscellaneous fixes from Raphael and Slava. All but two are marked for stable" * tag 'ceph-for-7.1-rc4' of https://github.com/ceph/ceph-client: libceph: Fix potential null-ptr-deref in decode_choose_args() libceph: handle rbtree insertion error in decode_choose_args() libceph: Fix potential out-of-bounds access in osdmap_decode() ceph: put folios not suitable for writeback ceph: add ceph_has_realms_with_quotas() check to ceph_quota_update_statfs() libceph: Fix potential out-of-bounds access in __ceph_x_decrypt() ceph: fix BUG_ON in __ceph_build_xattrs_blob() due to stale blob size ceph: fix a buffer leak in __ceph_setxattr() libceph: Fix unnecessarily high ceph_decode_need() for uniform bucket libceph: Fix potential out-of-bounds access in crush_decode()
2026-05-15Merge tag 'nfsd-7.1-1' of ↵Linus Torvalds-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Fixes for this release: - Correctness fix for the new sunrpc cache netlink protocol Marked for stable: - Correctness fixes for delegated attributes - Prevent an infinite loop when revoking layouts" * tag 'nfsd-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix infinite loop in layout state revocation sunrpc: start cache request seqno at 1 to fix netlink GET_REQS nfsd: update mtime/ctime on COPY in presence of delegated attributes nfsd: update mtime/ctime on CLONE in presense of delegated attributes nfsd: fix file change detection in CB_GETATTR nfsd: fix GET_DIR_DELEGATION when VFS leases are disabled
2026-05-14Merge tag 'net-7.1-rc4' of ↵Linus Torvalds-542/+881
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Previous releases - regressions: - ethtool: fix NULL pointer dereference in phy_reply_size - netfilter: - allocate hook ops while under mutex - close dangling table module init race - restore nf_conntrack helper propagation via expectation - tcp: - fix potential UAF in reqsk_timer_handler(). - fix out-of-bounds access for twsk in tcp_ao_established_key(). - vsock: fix empty payload in tap skb for non-linear buffers - hsr: fix NULL pointer dereference in hsr_get_node_data() - eth: - cortina: fix RX drop accounting - ice: fix locking in ice_dcb_rebuild() Previous releases - always broken: - napi: avoid gro timer misfiring at end of busypoll - sched: - dualpi2: initialize timer earlier in dualpi2_init() - sch_cbs: Call qdisc_reset for child qdisc - shaper: - fix ordering issue in net_shaper_commit() - reject handle IDs exceeding internal bit-width - ipv6: flowlabel: enforce per-netns limit for unprivileged callers - tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring - smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint - sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL - batman-adv: - reject new tp_meter sessions during teardown - purge non-released claims - eth: - i40e: cleanup PTP registration on probe failure - idpf: fix double free and use-after-free in aux device error paths - ena: fix potential use-after-free in get_timestamp" * tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) net: phy: DP83TC811: add reading of abilities net: tls: prevent chain-after-chain in plain text SG net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot macsec: use rcu_work to defer TX SA crypto cleanup out of softirq macsec: use rcu_work to defer RX SA crypto cleanup out of softirq macsec: introduce dedicated workqueue for SA crypto cleanup net: net_failover: Fix the deadlock in slave register MAINTAINERS: update atlantic driver maintainer selftests/tc-testing: Add QFQ/CBS qlen underflow test net/sched: sch_cbs: Call qdisc_reset for child qdisc FDDI: defza: Sanitise the reset safety timer net: ethernet: ravb: Do not check URAM suspension when WoL is active ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS net: atm: fix skb leak in sigd_send() default branch net: ethtool: phy: avoid NULL deref when PHY driver is unbound net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled net: shaper: reject QUEUE scope handle with missing id ...
2026-05-14net: tls: prevent chain-after-chain in plain text SGJakub Kicinski-6/+18
Sashiko points out that if end = 0 (start != 0) the current code will create a chain link to content type right after the wrap link: This would create a chain where the wrap link points directly to another chain link. The scatterlist API sg_next iterator does not recursively resolve consecutive chain links. meaning this is illegal input to crypto. The wrapping link is unnecessary if end = 0. end is the entry after the last one used so end = 0 means there's nothing pushed after the wrap: end start i v v v [ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap] Skip the wrapping in this case. TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0. This avoids the chain-after-chain. Move the wrap chaining before marking END and chaining off content type, that feels like more logical ordering to me, but should not matter from functional perspective. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-14net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ringJakub Kicinski-4/+2
When an sk_msg scatterlist ring wraps (sg.end < sg.start), tls_push_record() chains the tail portion of the ring to the head using sg_chain(). An extra entry in the sg array is reserved for this: struct sk_msg_sg { [...] /* The extra two elements: * 1) used for chaining the front and sections when the list becomes * partitioned (e.g. end < start). The crypto APIs require the * chaining; * 2) to chain tailer SG entries after the message. */ struct scatterlist data[MAX_MSG_FRAGS + 2]; The current code uses MAX_SKB_FRAGS + 1 as the ring size: sg_chain(&msg_pl->sg.data[msg_pl->sg.start], MAX_SKB_FRAGS - msg_pl->sg.start + 1, msg_pl->sg.data); This places the chain pointer at sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. = &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 = data[start + (MAX_SKB_FRAGS - start + 1) - 1] = data[MAX_SKB_FRAGS] instead of the true last entry. This is likely due to a "race" of the commit under Fixes landing close to commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down") Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested by Sabrina). Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-14net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slotXiang Mei-1/+2
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt() populates V2 entries starting at index 1, so when no V1 device is selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] == NULL and ism_chid[0] == 0. smc_v2_determine_accepted_chid() then matches the peer's CHID against the array starting from index 0 using the CHID alone. A malicious peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches the empty slot, ini->ism_selected becomes 0, and the subsequent ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at offsetof(struct smcd_dev, lgr_lock) == 0x68: BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0 Write of size 4 at addr 0000000000000068 by task exploit/144 Call Trace: _raw_spin_lock_bh smc_conn_create (net/smc/smc_core.c:1997) __smc_connect (net/smc/af_smc.c:1447) smc_connect (net/smc/af_smc.c:1720) __sys_connect __x64_sys_connect do_syscall_64 Require ism_dev[i] to be non-NULL before accepting a CHID match. Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-13net: net_failover: Fix the deadlock in slave registerFaicker Mo-1/+5
There is netdev_lock_ops() before the NETDEV_REGISTER notifier in register_netdevice(), so use the non-locking functions in net_failover_slave_register(). failover_slave_register() in failover_existing_slave_register() adds lock and unlock ops too. Call Trace: <TASK> __schedule+0x30d/0x7a0 schedule+0x27/0x90 schedule_preempt_disabled+0x15/0x30 __mutex_lock.constprop.0+0x538/0x9e0 __mutex_lock_slowpath+0x13/0x20 mutex_lock+0x3b/0x50 dev_set_mtu+0x40/0xe0 net_failover_slave_register+0x24/0x280 failover_slave_register+0x103/0x1b0 failover_event+0x15e/0x210 ? dropmon_net_event+0xac/0xe0 notifier_call_chain+0x5e/0xe0 raw_notifier_call_chain+0x16/0x30 call_netdevice_notifiers_info+0x52/0xa0 register_netdevice+0x5f4/0x7c0 register_netdev+0x1e/0x40 _mlx5e_probe+0xe2/0x370 [mlx5_core] mlx5e_probe+0x59/0x70 [mlx5_core] ? __pfx_mlx5e_probe+0x10/0x10 [mlx5_core] Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-13net/sched: sch_cbs: Call qdisc_reset for child qdiscJamal Hadi Salim-1/+15
During a reset, CBS is not calling reset on its child qdisc, which might cause qlen/backlog accounting issues. For example, if we have CBS with a QFQ parent and a netem child with delay, we can create a scenario where the parent's qlen underflows. QFQ, specifically, uses qlen to check whether it should deference a pointer, so this scenario may cause a null-ptr deref in QFQ: [ 43.875639][ T319] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] SMP KASAN NOPTI [ 43.876124][ T319] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 43.876417][ T319] CPU: 10 UID: 0 PID: 319 Comm: ping Not tainted 7.0.0-13039-ge728258debd5 #773 PREEMPT(full) [ 43.876751][ T319] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 43.876949][ T319] RIP: 0010:qfq_dequeue+0x35c/0x1650 [ 43.877123][ T319] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 43.877648][ T319] RSP: 0018:ffff8881017ef4f0 EFLAGS: 00010216 [ 43.877845][ T319] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 43.878073][ T319] RDX: 0000000000000009 RSI: 0000000c40000000 RDI: ffff88810eef02b0 [ 43.878306][ T319] RBP: ffff88810eef0000 R08: ffff88810eef0280 R09: 1ffff1102120fd63 [ 43.878523][ T319] R10: 1ffff1102120fd66 R11: 1ffff1102120fd67 R12: 0000000c40000000 [ 43.878742][ T319] R13: ffff88810eef02b8 R14: 0000000000000048 R15: 0000000020000000 [ 43.878959][ T319] FS: 00007f9c51c47c40(0000) GS:ffff88817a0be000(0000) knlGS:0000000000000000 [ 43.879214][ T319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 43.879403][ T319] CR2: 000055e69a2230a8 CR3: 000000010c07a000 CR4: 0000000000750ef0 [ 43.879621][ T319] PKRU: 55555554 [ 43.879735][ T319] Call Trace: [ 43.879844][ T319] <TASK> [ 43.879924][ T319] __qdisc_run+0x169/0x1900 [ 43.880075][ T319] ? dev_qdisc_enqueue+0x8b/0x210 [ 43.880222][ T319] __dev_queue_xmit+0x2346/0x37a0 [ 43.880376][ T319] ? register_lock_class+0x3f/0x800 [ 43.880531][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880684][ T319] ? __pfx___dev_queue_xmit+0x10/0x10 [ 43.880834][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.880977][ T319] ? __lock_acquire+0x819/0x1df0 [ 43.881124][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881275][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881418][ T319] ? __asan_memcpy+0x3c/0x60 [ 43.881563][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.881708][ T319] ? eth_header+0x165/0x1a0 [ 43.881853][ T319] ? lockdep_hardirqs_on_prepare+0xdb/0x1a0 [ 43.882031][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882174][ T319] ? neigh_resolve_output+0x3cc/0x7e0 [ 43.882325][ T319] ? srso_alias_return_thunk+0x5/0xfbef5 [ 43.882471][ T319] ip_finish_output2+0x6b6/0x1e10 Fix this by calling qdisc_reset for CBS' child qdisc. Sashiko caught an issue which could result in a null ptr deref if qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed by this patch. We add an early return if the qdisc is null to address this. This is a similar approach used by two other fixes[1][2]. The proper fix for this specific issue elucidated by sashiko is to remove the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc isn't attached anywhere yet at that point, calling the reset callback doesn't make much sense (and as stated has been a source of two other bugs). We plan on submitting this fix in a later patch. [1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/ [2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/ Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12ethtool: fix ethnl_bitmap32_not_zero() bit interval semanticsChenguang Zhao-4/+4
ethnl_bitmap32_not_zero() should return true if some bit in [start, end) is set: - Fix inverted memchr_inv() sense: return true when the scan finds a non-zero byte, not when the middle words are all zero. - Return false for an empty interval (end <= start). - When end is 32-bit aligned, indices in [start, end) do not include any bits from map[end_word]; return false after earlier checks found no non-zero data. Fixes: 10b518d4e6dd ("ethtool: netlink bitset handling") Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepointXiang Mei-1/+1
The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk: __string(name, smc->conn.lnk->ibname) conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on these paths already handles this (e.g. !conn->lnk in SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first sendmsg()/recvmsg() on an SMC-D socket crashes: Oops: general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [...] RIP: 0010:strlen+0x1e/0xa0 Call Trace: trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44) smc_rx_recvmsg (net/smc/smc_rx.c:515) smc_recvmsg (net/smc/af_smc.c:2859) __sys_recvfrom (net/socket.c:2315) __x64_sys_recvfrom (net/socket.c:2326) do_syscall_64 The faulting address 0x3e0 is offsetof(struct smc_link, ibname), confirming the NULL ->lnk deref. Enabling the tracepoint requires root, but the trigger itself is unprivileged: socket(AF_SMC, ...) has no capability check, and SMC-D negotiation needs no admin step on s390 or on x86 with the loopback ISM device loaded. Log an empty device name for SMC-D instead of dereferencing NULL. Fixes: aff3083f10bf ("net/smc: Introduce tracepoints for tx and rx msg") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoSNicolò Coccia-9/+8
A logic flaw in __smc_setsockopt() allows a local unprivileged user to cause a Denial of Service (DoS) by holding the socket lock indefinitely. The function __smc_setsockopt() calls copy_from_sockptr() while holding lock_sock(sk). By passing a userfaultfd-monitored memory page (or FUSE-backed memory on systems where unprivileged userfaultfd is disabled) as the optval, an attacker can halt execution during the copy operation, keeping the lock held. Combined with asynchronous tear-down operations like shutdown(), this exhausts the kernel wq (kworkers) and triggers the hung task watchdog. [ 240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds. [ 240.123489] Call Trace: [ 240.123501] smc_shutdown+... [ 240.123512] lock_sock_nested+... This patch moves the user-space copy outside the lock_sock() critical section to prevent the issue. Fixes: a6a6fe27bab4 ("net/smc: Dynamic control handshake limitation by socket options") Signed-off-by: Nicolò Coccia <n.coccia96@gmail.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12net: atm: fix skb leak in sigd_send() default branchWei Yang-0/+1
The default branch in sigd_send() calls sock_put() and returns -EINVAL without freeing the skb, while all other exit paths do so. Add the missing dev_kfree_skb() before sock_put() to fix the leak. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Wei Yang <albinwyang@tencent.com> Link: https://patch.msgid.link/20260509122358.1102997-1-albin_yang@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12net: ethtool: phy: avoid NULL deref when PHY driver is unboundDavid Carlier-4/+6
phydev->drv can become NULL while the phy_device is still attached to its net_device, namely after the PHY driver is unbound via sysfs: echo <mdio_id> > /sys/bus/mdio_bus/drivers/<phy_drv>/unbind phy_remove() clears phydev->drv but doesn't call phy_detach(), so the phy_device stays in the link topology xarray and ethnl_req_get_phydev() still hands it back. ETHTOOL_MSG_PHY_GET then oopses on: rep_data->drvname = kstrdup(phydev->drv->name, GFP_KERNEL); drvname is already treated as optional by phy_reply_size(), phy_fill_reply() and phy_cleanup_data(), so just skip the allocation when there is no driver bound. Fixes: 9dd2ad5e92b9 ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Cc: stable@vger.kernel.org # 6.13.x Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260509215046.107157-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-12libceph: Fix potential null-ptr-deref in decode_choose_args()Raphael Zimmer-1/+2
A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself contains a CRUSH map. When decoding this CRUSH map in crush_decode(), an array of max_buckets CRUSH buckets is decoded, where some indices may not refer to actual buckets and are therefore set to NULL. The received CRUSH map may optionally contain choose_args that get decoded in decode_choose_args(). When decoding a crush_choose_arg_map, a series of choose_args for different buckets is decoded, with the bucket_index being read from the incoming message. It is only checked that the bucket index does not exceed max_buckets, but not that it doesn't point to an index with a NULL bucket. If a (potentially corrupted) message contains a crush_choose_arg_map including such a bucket_index, a null pointer dereference may occur in the subsequent processing when attempting to access the bucket with the given index. This patch fixes the issue by extending the affected check. Now, it is only attempted to access the bucket if it is not NULL. Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-12libceph: handle rbtree insertion error in decode_choose_args()Raphael Zimmer-1/+4
A message of type CEPH_MSG_OSD_MAP contains an OSD map that itself contains a CRUSH map. The received CRUSH map may optionally contain choose_args that get decoded in decode_choose_args(). In this function, num_choose_arg_maps is read from the message, and a corresponding number of crush_choose_arg_maps gets decoded afterwards. Each crush_choose_arg_map has a choose_args_index, which serves as the key when inserting it into the choose_args rbtree of the decoded crush_map. If a (potentially corrupted) message contains two crush_choose_arg_maps with the same index, the assertion in insert_choose_arg_map() triggers a kernel BUG when trying to insert the second crush_choose_arg_map. This patch fixes the issue by switching to the non-asserting rbtree insertion function and rejecting the message if the insertion fails. [ idryomov: changelog ] Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-12net: shaper: reject QUEUE scope handle with missing idJakub Kicinski-2/+7
net_shaper_parse_handle() does not enforce that the user provides the handle ID. For NODE the ID defaults to UNSPEC for all other cases it defaults to 0. For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes less intuitive sense. Specifically because the behavior should (IMHO) be the same for all cases where there may be more than one ID (QUEUE and NODE). We should either document this as intentional or reject. I picked the latter with no strong conviction. Fixes: 4b623f9f0f59 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-11-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: enforce singleton NETDEV scope with id 0Jakub Kicinski-0/+6
The NETDEV scope represents a singleton root shaper in the per-device hierarchy. All code assumes NETDEV shapers have id 0: net_shaper_default_parent() hardcodes parent->id = 0 when returning the NETDEV parent for QUEUE/NODE children, and the UAPI documentation describes NETDEV scope as "the main shaper" (singular, not plural). Make sure we reject non-0 IDs. Fixes: 4b623f9f0f59 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-10-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: reject handle IDs exceeding internal bit-widthJakub Kicinski-2/+11
net_shaper_parse_handle() reads the user-supplied handle ID via nla_get_u32(), accepting the full u32 range. However, the xarray key is built by net_shaper_handle_to_index() using FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off the upper bits at runtime. A user-supplied NODE id like 0x04000123 becomes id 0x123. Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC (0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with the sentinel used internally by the group operation to signal "allocate a new NODE id". Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF) in the policy. Fixes: 4b623f9f0f59 ("net-shapers: implement NL get operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-9-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: fix undersized reply skb allocation in GROUP commandJakub Kicinski-2/+8
net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX attribute (via net_shaper_fill_binding()) and the nested NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but the reply skb at the call site in net_shaper_nl_group_doit() is allocated using net_shaper_handle_size(), which only accounts for the nested handle. The allocation is therefore short by nla_total_size(sizeof(u32)) (8 bytes) for the IFINDEX attribute. In practice the slab allocator rounds up the small allocation so the bug is latent, but the size accounting is wrong and could bite if the reply grew further. Introduce net_shaper_group_reply_size() that accounts for the full reply payload and use it both at the genlmsg_new() call site and in the defensive WARN_ONCE message. Fixes: 5d5d4700e75d ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-7-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: set ret to -ENOMEM when genlmsg_new() fails in group_doitJakub Kicinski-1/+3
genlmsg_new() alloc failure path in net_shaper_nl_group_doit() forgets to set ret before jumping to error handling. Fixes: 5d5d4700e75d ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-6-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: reject duplicate leaves in GROUP requestJakub Kicinski-15/+45
net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES entries. When userspace supplies the same leaf handle twice, the same old-parent pointer lands twice in old_nodes[]. The cleanup loop double frees the parent. Of course the same parent may still be in old_nodes[] twice if we are moving multiple of its leaves. Note that this patch also implicitly fixes the fact that the i >= leaves_count path forgets to set ret. Fixes: 5d5d4700e75d ("net-shapers: implement NL group operation") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: fix trivial ordering issue in net_shaper_commit()Jakub Kicinski-1/+10
We should update the entry before we mark it as valid. Fixes: 93954b40f6a4 ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: shaper: flip the polarity of the valid flagJakub Kicinski-17/+17
The usual way of inserting entries which are not yet fully ready into XArray is to have a VALID flag. The shaper code has a NOT_VALID flag. Since XArray code does not let us create entries with marks already set - the creation of entries is currently not atomic. Flip the polarity of the VALID flag. This closes the tiny race in net_shaper_pre_insert() of entries being created without the NOT_VALID flag. Fixes: 93954b40f6a4 ("net-shapers: implement NL set and delete operations") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260510192904.3987113-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12vsock/virtio: fix empty payload in tap skb for non-linear buffersStefano Garzarella-28/+12
For non-linear skbs, virtio_transport_build_skb() goes through virtio_transport_copy_nonlinear_skb() to copy the original payload in the new skb to be delivered to the vsockmon tap device. This manually initializes an iov_iter but does not set iov_iter.count. Since the iov_iter is zero-initialized, the copy length is zero and no payload is actually copied to the monitor interface, leaving data un-initialized. Fix this by removing the linear vs non-linear split and using skb_copy_datagram_iter() with iov_iter_kvec() for all cases, as vhost-vsock already does. This handles both linear and non-linear skbs, properly initializes the iov_iter, and removes the now unused virtio_transport_copy_nonlinear_skb(). While touching this code, let's also check the return value of skb_copy_datagram_iter(), even though it's unlikely to fail. Fixes: 4b0bf10eb077 ("vsock/virtio: non-linear skb handling for tap") Reported-by: Yiqi Sun <sunyiqixm@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-3-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12vsock/virtio: fix length and offset in tap skb for split packetsStefano Garzarella-5/+6
virtio_transport_build_skb() builds a new skb to be delivered to the vsockmon tap device. To build the new skb, it uses the original skb data length as payload length, but as the comment notes, the original packet stored in the skb may have been split in multiple packets, so we need to use the length in the header, which is correctly updated before the packet is delivered to the tap, and the offset for the data. This was also similar to what we did before commit 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") where we probably missed something during the skb conversion. Also update the comment above, which was left stale by the skb conversion and still mentioned a buffer pointer that no longer exists. Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-12net: hsr: fix NULL pointer dereference in hsr_get_node_data()Quan Sun-1/+4
In the HSR (High-availability Seamless Redundancy) protocol, node information is maintained in the node_db. When a supervision frame is received, node->addr_B_port is updated to track the receiving port type (e.g., HSR_PT_SLAVE_B). If the underlying physical interface associated with this slave port is removed (e.g., via `ip link del`), hsr_del_port() frees the hsr_port object. However, the stale node->addr_B_port reference is kept in the node_db until the node ages out. Subsequently, if userspace queries the node status via the Netlink command HSR_C_GET_NODE_STATUS, the kernel calls hsr_get_node_data(). This function unconditionally dereferences the pointer returned by hsr_port_get_hsr(): if (node->addr_B_port != HSR_PT_NONE) { port = hsr_port_get_hsr(hsr, node->addr_B_port); *addr_b_ifindex = port->dev->ifindex; // <-- NULL deref } If the slave port has been deleted, hsr_port_get_hsr() returns NULL, resulting in a kernel panic. Oops: general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] RIP: 0010:hsr_get_node_data+0x7b6/0x9e0 Call Trace: <TASK> hsr_get_node_status+0x445/0xa40 Fix this by adding a proper NULL pointer check. If the port lookup fails due to a stale port type, gracefully treat it as if no valid port exists and assign -1 to the interface index. Steps to reproduce: 1. Create an HSR interface with two slave devices. 2. Receive a supervision frame to populate node_db with addr_B_port assigned to SLAVE_B. 3. Delete the underlying slave device B. 4. Send an HSR_C_GET_NODE_STATUS Netlink message. Fixes: c5a759117210 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.") Signed-off-by: Quan Sun <2022090917019@std.uestc.edu.cn> Link: https://patch.msgid.link/20260508124636.1462346-1-2022090917019@std.uestc.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-11net/sched: dualpi2: initialize timer earlier in dualpi2_init()Davide Caratti-2/+2
'pi2_timer' needs to be initialized in all error paths of dualpi2_init(): otherwise, a failure in qdisc_create_dflt() causes the following crash in dualpi2_destroy(): # tc qdisc add dev crash0 handle 1: root dualpi2 BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 4 UID: 0 PID: 471 Comm: tc Tainted: G E 7.1.0-rc1-virtme #2 PREEMPT(full) Tainted: [E]=UNSIGNED_MODULE Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:hrtimer_active+0x39/0x60 Code: f9 eb 23 0f b6 41 38 3c 01 0f 87 87 64 c0 ff 83 e0 01 75 33 48 39 4a 50 74 28 44 3b 42 10 75 06 48 3b 51 30 74 21 48 8b 51 30 <44> 8b 42 10 41 f6 c0 01 74 cf f3 90 44 8b 42 10 41 f6 c0 01 74 c3 RSP: 0018:ffffd0db80b93620 EFLAGS: 00010282 RAX: ffffffffc0400320 RBX: ffff8cf24a4c86b8 RCX: ffff8cf24a4c86b8 RDX: 0000000000000000 RSI: ffff8cf2429c2ab0 RDI: ffff8cf24a4c86b8 RBP: 00000000fffffff4 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000001 R11: ffff8cf24a39c500 R12: ffff8cf24822c000 R13: ffffd0db80b936c0 R14: ffffffffc02cf360 R15: 00000000ffffffff FS: 00007fbc01706580(0000) GS:ffff8cf2dc759000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000008e02003 CR4: 0000000000172ef0 Call Trace: <TASK> hrtimer_cancel+0x15/0x40 dualpi2_destroy+0x20/0x40 [sch_dualpi2] qdisc_create+0x230/0x570 tc_modify_qdisc+0x716/0xc10 rtnetlink_rcv_msg+0x188/0x780 netlink_rcv_skb+0xcd/0x150 netlink_unicast+0x1ba/0x290 netlink_sendmsg+0x242/0x4d0 ____sys_sendmsg+0x39e/0x3e0 ___sys_sendmsg+0xe1/0x130 __sys_sendmsg+0xad/0x110 do_syscall_64+0x14f/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fbc0188b08e Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa RSP: 002b:00007fff593260e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc0188b08e RDX: 0000000000000000 RSI: 00007fff59326190 RDI: 0000000000000003 RBP: 00007fff593260f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 000055f06124f260 R13: 0000000069fca043 R14: 000055f061255640 R15: 000055f06124d3f8 </TASK> Modules linked in: sch_dualpi2(E) CR2: 0000000000000010 [1] https://lore.kernel.org/netdev/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/20200725201707.16909-1-xiyou.wangcong@gmail.com/ v2: - rebased on top of latest net.git Fixes: 320d031ad6e4 ("sched: Struct definition and parsing of dualpi2 qdisc") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/1faca91179702b31da5d87653e1e036543e32722.1778259798.git.dcaratti@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-11tcp: Fix out-of-bounds access for twsk in tcp_ao_established_key().Kuniyuki Iwashima-1/+2
lockdep_sock_is_held() was added in tcp_ao_established_key() by the cited commit. It can be called from tcp_v[46]_timewait_ack() with twsk. Since it does not have sk->sk_lock, the lockdep annotation results in out-of-bound access. $ pahole -C tcp_timewait_sock vmlinux | grep size /* size: 288, cachelines: 5, members: 8 */ $ pahole -C sock vmlinux | grep sk_lock socket_lock_t sk_lock; /* 440 192 */ Let's not use lockdep_sock_is_held() for TCP_TIME_WAIT. Fixes: 6b2d11e2d8fc ("net/tcp: Add missing lockdep annotations for TCP-AO hlist traversals") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260508120853.4098365-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-11net/rds: reset op_nents when zerocopy page pin failsAllison Henderson-0/+1
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(), the pinned pages are released with put_page(), and rm->data.op_mmp_znotifier is cleared. But we fail to properly clear rm->data.op_nents. Later when rds_message_purge() is called from rds_sendmsg() the cleanup loop iterates over the incorrectly non zero number of op_nents and frees them again. Fix this by properly resetting op_nents when it should be in rds_message_zcopy_from_user(). Fixes: 0cebaccef3ac ("rds: zerocopy Tx support.") Signed-off-by: Allison Henderson <achender@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260505234336.2132721-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-11libceph: Fix potential out-of-bounds access in osdmap_decode()Raphael Zimmer-1/+1
When decoding osd_state and osd_weight from an incoming osdmap in osdmap_decode(), both are decoded for each osd, i.e., map->max_osd times. The ceph_decode_need() check only accounts for sizeof(*map->osd_weight) once. This can potentially result in an out-of-bounds memory access if the incoming message is corrupted such that the max_osd value exceeds the actual content of the osdmap message. This patch fixes the issue by changing the corresponding part in the ceph_decode_need() check to account for map->max_osd*sizeof(*map->osd_weight). Cc: stable@vger.kernel.org Fixes: dcbc919a5dc8 ("libceph: switch osdmap decoding to use ceph_decode_entity_addr") Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-11libceph: Fix potential out-of-bounds access in __ceph_x_decrypt()Raphael Zimmer-0/+5
In __ceph_x_decrypt(), a part of the buffer p is interpreted as a ceph_x_encrypt_header, and the magic field of this struct is accessed. This happens without any guarantee that the buffer is large enough to hold this struct. The function parameter ciphertext_len represents the length of the ciphertext to decrypt and is guaranteed to be at most the remaining size of the allocated buffer p. However, this value is not necessarily greater than sizeof(ceph_x_encrypt_header). E.g., a message frame of type FRAME_TAG_AUTH_REPLY_MORE, that is just as long to hold the ciphertext at its end with a ciphertext_len of 8 or less, can trigger an out-of-bounds memory access when accessing hdr->magic. This patch fixes the issue by adding a check to ensure that the decrypted plaintext in the buffer is large enough to represent at least the ceph_x_encrypt_header. Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-11libceph: Fix unnecessarily high ceph_decode_need() for uniform bucketRaphael Zimmer-2/+1
In crush_decode_uniform_bucket(), the item_weight field of the bucket is set. This is a single field of type u32 since the uniform bucket uses the same weight for all items. The value in ceph_decode_need() is set to (1+b->h.size) * sizeof(u32), which is higher than actually needed. This patch removes the call to ceph_decode_need() with the unnecessarily high value and switches the subsequent operation from ceph_decode_32() to ceph_decode_32_safe(), which already includes the correct bounds check. Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-11libceph: Fix potential out-of-bounds access in crush_decode()Raphael Zimmer-5/+5
A message of type CEPH_MSG_OSD_MAP containing a crush map with at least one bucket has two fields holding the bucket algorithm. If the values in these two fields differ, an out-of-bounds access can occur. This is the case because the first algorithm field (alg) is used to allocate the correct amount of memory for a bucket of this type, while the second algorithm field inside the bucket (b->alg) is used in the subsequent processing. This patch fixes the issue by adding a check that compares alg and b->alg and aborts the processing in case they differ. Furthermore, b->alg is set to 0 in this case, because the destruction of the crush map also uses this field to determine the bucket type, which can again result in an out-of-bounds access when trying to free the memory pointed to by the fields of the bucket. To correctly free the memory allocated for the bucket in such a case, the corresponding call to kfree is moved from the algorithm-specific crush_destroy_bucket functions to the generic crush_destroy_bucket(). Cc: stable@vger.kernel.org Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2026-05-10Merge tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadvJakub Kicinski-46/+172
Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - fix integer overflow on buff_pos, by Lyes Bourennani - fix invalid tp_meter access during teardown, by Jiexun Wang (2 patches) - stop caching unowned originator pointers in BAT IV, by Jiexun Wang - tp_meter: fix tp_num leak on kmalloc failure, by Sven Eckelmann - fix BLA refcounting issues, by Sven Eckelmann (3 patches) * tag 'batadv-net-pullrequest-20260508' of https://git.open-mesh.org/batadv: batman-adv: bla: put backbone reference on failed claim hash insert batman-adv: bla: only purge non-released claims batman-adv: bla: prevent use-after-free when deleting claims batman-adv: tp_meter: fix tp_num leak on kmalloc failure batman-adv: stop caching unowned originator pointers in BAT IV batman-adv: stop tp_meter sessions during mesh teardown batman-adv: reject new tp_meter sessions during teardown batman-adv: fix integer overflow on buff_pos ==================== Link: https://patch.msgid.link/20260508154314.12817-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-10sunrpc: start cache request seqno at 1 to fix netlink GET_REQSJeff Layton-1/+1
sunrpc_cache_requests_snapshot() filters requests with crq->seqno <= min_seqno. The min_seqno for the first netlink dump call is cb->args[0] which is 0. Since next_seqno was initialized to 0, the very first cache request got seqno=0 and was silently skipped by the snapshot (0 <= 0 is true). This caused netlink-based GET_REQS to return 0 pending requests even when a request was queued, preventing mountd from resolving cache entries (particularly expkey/nfsd.fh). The unresolved CACHE_PENDING state blocked all further notifications for the entry, leading to permanent NFS4ERR_DELAY hangs. Start next_seqno at 1 so all requests have seqno >= 1 and pass the snapshot filter when min_seqno is 0. Fixes: facc4e3c8042 ("sunrpc: split cache_detail queue into request and reader lists") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-05-10rxrpc: Also unshare DATA/RESPONSE packets when paged frags are presentHyunwoo Kim-2/+5
The DATA-packet handler in rxrpc_input_call_event() and the RESPONSE handler in rxrpc_verify_response() copy the skb to a linear one before calling into the security ops only when skb_cloned() is true. An skb that is not cloned but still carries externally-owned paged fragments (e.g. SKBFL_SHARED_FRAG set by splice() into a UDP socket via __ip_append_data, or a chained skb_has_frag_list()) falls through to the in-place decryption path, which binds the frag pages directly into the AEAD/skcipher SGL via skb_to_sgvec(). Extend the gate to also unshare when skb_has_frag_list() or skb_has_shared_frag() is true. This catches the splice-loopback vector and other externally-shared frag sources while preserving the zero-copy fast path for skbs whose frags are kernel-private (e.g. NIC page_pool RX, GRO). The OOM/trace handling already in place is reused. Fixes: d0d5c0cd1e71 ("rxrpc: Use skb_unshare() rather than skb_cow_data()") Cc: stable@vger.kernel.org Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-09Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds-29/+87
Pull bpf fixes from Alexei Starovoitov: - Fix sk_local_storage diag dump via netlink (Amery Hung) - Fix off-by-one in arena direct-value access (Junyoung Jang) - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan) - Fix type confusion in bpf_*_sock() (Kuniyuki Iwashima) - Reject TX-only AF_XDP sockets (Linpu Yu) - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon) - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup (Weiming Shi) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix off-by-one boundary validation in arena direct-value access xskmap: reject TX-only AF_XDP sockets bpf: Don't run arg-tracking analysis twice on main subprog bpf: Free reuseport cBPF prog after RCU grace period. bpf: tcp: Fix type confusion in sol_tcp_sockopt(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock(). mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow() selftest: bpf: Add test for bpf_tcp_sock() and RAW socket. bpf: tcp: Fix type confusion in bpf_tcp_sock(). tools/headers: Regenerate stddef.h to fix BPF selftests bpf: Fix sk_local_storage diag dumping uninitialized special fields bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup() sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}(). bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks bpf: Reject TCP_NODELAY in bpf-tcp-cc bpf: Reject TCP_NODELAY in TCP header option callbacks
2026-05-09xskmap: reject TX-only AF_XDP socketsLinpu Yu-0/+4
XSKMAP entries are used as redirect targets for incoming XDP frames. A TX-only AF_XDP socket lacks an Rx ring and cannot handle redirected traffic, but xsk_map_update_elem() currently allows such sockets to be inserted into the map. Redirecting packets to such a socket on the veth generic-XDP path causes a kernel crash in xsk_generic_rcv(). This became possible after xsk_is_setup_for_bpf_map() was removed from the XSKMAP update path, which allowed bound TX-only sockets to be inserted into the map. Reject TX-only sockets during XSKMAP updates to avoid the crash. They remain fully operational for pure Tx purposes outside XSKMAP. Fixes: 968be23ceaca ("xsk: Fix possible segfault at xskmap entry insertion") Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Yifan Wu <yifanwucs@gmail.com> Signed-off-by: Linpu Yu <linpu5433@gmail.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20260508144344.694-1-linpu5433@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-08Merge tag 'nf-26-05-08' of ↵Jakub Kicinski-359/+408
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Allow initial x_tables table replacement without emitting an audit log message. Delay the register message until after hooks are wired up to avoid unnecessary unregister logs during error unwinding. 2) Fix a NULL dereference by allocating hook ops before adding the table to the per-netns list. Use `synchronize_rcu()` during error unwinding to ensure the table stops processing packets before teardown. Defer audit log register message until all operations succeed. 3) Refactor xtables to use a single `xt_unregister_table_pre_exit` function. Eliminate code duplication by centralizing table unregistration logic within the xtables core. ebtables cannot be changed due to incompatibility. 4) Unregister xtables templates before module removal. This prevents a race condition where userspace instantiates a new table after the pernet unreg removed the current table. 5) Add `xtables_unregister_table_exit` to fully unregister netfilter tables during module removal. Unlink the table from dying lists, then free hook operations. 6) Implement a two-stage removal scheme for ebtables following the x_tables pattern. Assign table->ops while holding the ebt mutex to prevent exposing partially-filled structures. 7) Fix ebtables module initialization race. Register the template last in table initialization functions. Prevent table instantiation before pernet operations are available. 8) Fix a race condition in x_tables module initialization. Ensure pernet ops are fully set up before exposing the table to userspace. 9) Fix a race condition in ebtables module initialization, similar to previous patch. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Validate that the expectation tuple and mask netlink attributes are present when adding expectation via nfqueue, this fixes a possible null-ptr-deref. 12) Fix possible rare memleak in the SIP helper in case helper has been detached from conntrack entry, from Li Xiasong. 13) Fix refcount leak in nft_ct when creating custom expectation, also from Li Xiason. Patches 1-9 from Florian Westphal. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Check that tuple and mask netlink attributes are set when creating an expectation via nfqueue. * tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: fix missing expect put in obj eval netfilter: nf_conntrack_sip: get helper before allocating expectation netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue netfilter: nf_conntrack_expect: restore helper propagation via expectation netfilter: bridge: eb_tables: close module init race netfilter: x_tables: close dangling table module init race netfilter: ebtables: close dangling table module init race netfilter: ebtables: move to two-stage removal scheme netfilter: x_tables: add and use xtables_unregister_table_exit netfilter: x_tables: unregister the templates first netfilter: x_tables: add and use xt_unregister_table_pre_exit netfilter: x_tables: allocate hook ops while under mutex netfilter: x_tables: allow initial table replace without emitting audit log message ==================== Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALLBen Morris-0/+9
The SCTP_SENDALL path in sctp_sendmsg() iterates ep->asocs with list_for_each_entry_safe(), which caches the next entry in @tmp before the loop body runs. The body calls sctp_sendmsg_to_asoc(), which may drop the socket lock inside sctp_wait_for_sndbuf(). While the lock is dropped, another thread can SCTP_SOCKOPT_PEELOFF the association cached in @tmp, migrating it to a new endpoint via sctp_sock_migrate() (list_del_init() + list_add_tail() to newep->asocs), and optionally close the new socket which frees the association via kfree_rcu(). The cached @tmp can also be freed by a network ABORT for that association, processed in softirq while the lock is dropped. sctp_wait_for_sndbuf() revalidates @asoc (the current entry) on re-lock via the "sk != asoc->base.sk" and "asoc->base.dead" checks, but nothing revalidates @tmp. After a successful return, the iterator advances to the stale @tmp, yielding either a use-after-free (if the peeled socket was closed) or a list-walk onto the new endpoint's list head (type confusion of &newep->asocs as a struct sctp_association *). Both are reachable from CapEff=0; the type-confusion path gives controlled indirect call via the outqueue.sched->init_sid pointer. Fix by re-deriving @tmp from @asoc after sctp_sendmsg_to_asoc() returns. @asoc is known to still be on ep->asocs at that point: the only callers that list_del an association from ep->asocs are sctp_association_free() (which sets asoc->base.dead) and sctp_assoc_migrate() (which changes asoc->base.sk), and sctp_wait_for_sndbuf() checks both under the lock before any successful return; a tripped check propagates as err < 0 and the loop bails before the re-derive. The SCTP_ABORT path in sctp_sendmsg_check_sflags() returns 0 and the loop hits 'continue' before sctp_sendmsg_to_asoc() is ever called, so the @tmp cached by list_for_each_entry_safe() still covers the lock-held free that ba59fb027307 ("sctp: walk the list of asoc safely") was added for. Fixes: 4910280503f3 ("sctp: add support for snd flag SCTP_SENDALL process in sendmsg") Cc: stable@vger.kernel.org Signed-off-by: Ben Morris <bmorris@anthropic.com> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20260508001455.3137-1-joycathacker@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08net: shaper: Reject reparenting of existing nodesMohsin Bashir-7/+23
When an existing node-scope shaper is moved to a different parent via the group operation, the framework fails to update the leaves count on both the old and new parent shapers. Only newly created nodes (handle.id == NET_SHAPER_ID_UNSPEC) trigger the parent leaves increment at line 1039. This causes the parent's leaves counter to diverge from the actual number of children in the xarray. When the node is later deleted, pre_del_node() allocates an array sized by the stale leaves count, but the xarray iteration finds more children than expected, hitting the WARN_ON_ONCE guard and returning -EINVAL. Rather than adding reparenting support with complex leaves count bookkeeping, reject group calls that attempt to change an existing node's parent. Updates to an existing node's rate or leaves under the same parent remain permitted. We expect that for any modification of the topology user should always create new groups and let the kernel garbage collect the leaf-less nodes. Fixes: 5d5d4700e75d ("net-shapers: implement NL group operation") Signed-off-by: Mohsin Bashir <hmohsin@meta.com> Link: https://patch.msgid.link/20260506233745.111895-1-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08genetlink: free the skb on 'group >= family->n_mcgrps'Alice Ryhl-2/+6
These methods generally consume ownership of the provided skb, so even if an error path is encountered, the skb is freed. This is because the very first thing they do after some initial setup is to unconditionally consume the skb via consume_skb(skb). Any subsequent errors lead to the core netlink layer freeing the skb. However, there is one check that occurs before ownership is passed, which is the check for the group index. So if this error condition is encountered, then the skb is leaked. This error condition is generally considered a violation of the netlink API, so it's not expected to occur under normal circumstances. For the same reason, no callers check for this error condition, and no callers need to be adjusted. However, we should still follow the same ownership semantics of the rest of the function. Thus, free the skb in this codepath. Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Matthew Maurer <mmaurer@google.com> Fixes: 2a94fe48f32c ("genetlink: make multicast groups const, prevent abuse") Link: https://lore.kernel.org/r/845b36ba-7b3a-41f2-acb2-b284f253e2ca@lunn.ch Signed-off-by: Alice Ryhl <aliceryhl@google.com> Link: https://patch.msgid.link/20260506-genlmsg-return-v2-1-a63ee2a055d6@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08net: ethtool: fix NULL pointer dereference in phy_reply_sizeQuan Sun-2/+30
In phy_prepare_data(), several strings such as 'name', 'drvname', 'upstream_sfp_name', and 'downstream_sfp_name' are allocated using kstrdup(). However, these allocations were not checked for failure. If kstrdup() fails for 'name', it returns NULL while the function continues. This leads to a kernel NULL pointer dereference and panic later in phy_reply_size() when it unconditionally calls strlen() on the NULL pointer. While other strings like 'upstream_sfp_name' might be checked before access in certain code paths, failing to handle these allocations consistently can lead to incomplete data reporting or hidden bugs. Fix this by adding proper NULL checks for all kstrdup() calls in phy_prepare_data() and implement a centralized error handling path using goto labels to ensure all previously allocated resources are freed on failure. Fixes: 9dd2ad5e92b9 ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Signed-off-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260507131738.1173835-1-2022090917019@std.uestc.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08net: napi: Avoid gro timer misfiring at end of busypollDragos Tatulea-9/+12
When in irq deferral mode (defer-hard-irqs > 0), a short enough gro-flush timeout can trigger before NAPI_STATE_SCHED is cleared if the last poll in busy_poll_stop() takes too long. This can have the effect of leaving the queue stuck with interrupts disabled and no timer armed which results in a tx timeout if there is no subsequent busypoll cycle. To prevent this, defer the gro-flush timer arm after the last poll. Fixes: 7fd3253a7de6 ("net: Introduce preferred busy-polling") Co-developed-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Martin Karsten <mkarsten@uwaterloo.ca> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20260506090808.820559-2-dtatulea@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08ipv6: flowlabel: enforce per-netns limit for unprivileged callersMaoyi Xie-3/+11
fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are file scope and shared across netns. mem_check() reads fl_size to decide whether to deny non-CAP_NET_ADMIN callers. capable() runs against init_user_ns, so an unprivileged user in any non-init userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and starve every other unprivileged userns on the host. Add struct netns_ipv6::flowlabel_count, bumped and decremented next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new field fills the existing 4-byte hole after ipmr_seq, so struct netns_ipv6 stays the same size on 64-bit builds. Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the file was added. Machines and connection counts have grown. mem_check() folds an extra per-netns ceiling into the existing non-CAP_NET_ADMIN conditional. The ceiling is half of the total budget that unprivileged callers have ever been able to use, i.e. (FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With FL_MAX_SIZE doubled, this preserves the original per-user reach of 3K (what an unprivileged caller could already obtain before this change), while forcing an attacker to spread allocations across at least two netns to exhaust the global non-CAP_NET_ADMIN budget. CAP_NET_ADMIN against init_user_ns still bypasses both caps. The previous patch took ip6_fl_lock across mem_check and fl_intern, so the new flowlabel_count read in mem_check and the new flowlabel_count++ in fl_intern run under the same critical section. flowlabel_count is therefore plain int, like fl_size. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08ipv6: flowlabel: take ip6_fl_lock across mem_check and fl_internMaoyi Xie-13/+21
mem_check() in net/ipv6/ip6_flowlabel.c reads fl_size without holding ip6_fl_lock. fl_intern() takes the lock immediately afterwards. The two checks therefore race against concurrent fl_intern, ip6_fl_gc and ip6_fl_purge writers, which makes the mem_check budget check approximate. Move spin_lock_bh(&ip6_fl_lock) and the matching unlock from fl_intern() into its only caller ipv6_flowlabel_get(). The mem_check() call now runs under the same critical section as the fl_intern() insert, so the budget check is exact. With all writers and the read of fl_size under ip6_fl_lock, convert fl_size from atomic_t to plain int. The four sites that update or read fl_size are fl_intern (insert path), ip6_fl_gc (garbage collector, the !sched check and the per-entry decrement), ip6_fl_purge (per-netns purge), and mem_check (budget check), and all four now run under ip6_fl_lock. This is a prerequisite for adding a per-netns budget alongside fl_size. The follow-up patch adds netns_ipv6::flowlabel_count and folds it into mem_check(). Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08tcp: Fix imbalanced icsk_accept_queue count.Kuniyuki Iwashima-1/+1
When TCP socket migration happens in reqsk_timer_handler(), @sk_listener will be updated with the new listener. When we call __inet_csk_reqsk_queue_drop(), the listener must be the one stored in req->rsk_listener. The cited commit accidentally replaced oreq->rsk_listener with sk_listener, leading to imbalanced icsk_accept_queue count. Let's pass the correct listener to __inet_csk_reqsk_queue_drop(). Fixes: e8c526f2bdf1 ("tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260506035954.1563147-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08tcp: Fix potential UAF in reqsk_timer_handler().Kuniyuki Iwashima-1/+1
When TCP socket migration fails at inet_ehash_insert() in reqsk_timer_handler(), we jump to the no_ownership: label and free the new reqsk immediately with __reqsk_free(). Thus, we must stop the new reqsk's timer before jumping to the label, but the timer might be missed since the cited commit, resulting in UAF. As we are in the original reqsk's timer context, we can safely call timer_delete_sync() for the new reqsk. Let's pass false to __inet_csk_reqsk_queue_drop() to stop the new reqsk's timer. Fixes: 83fccfc3940c ("inet: fix potential deadlock in reqsk_queue_unlink()") Reported-by: Damiano Melotti <melotti@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260506035954.1563147-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08bpf: Free reuseport cBPF prog after RCU grace period.Kuniyuki Iwashima-3/+12
Eulgyu Kim reported the splat below with a repro. [0] The repro sets up a UDP reuseport group with a cBPF prog and replaces it with a new one while another thread is sending a UDP packet to the group. The reuseport prog is freed by sk_reuseport_prog_free(). bpf_prog_put() is called for "e"BPF prog to destruct through multiple stages while cBPF prog is freed immediately by bpf_release_orig_filter() and bpf_prog_free(). If a reuseport prog is detached from the setsockopt() path (reuseport_attach_prog() or reuseport_detach_prog()), sk_reuseport_prog_free() is called without waiting for RCU readers to complete, resulting in various bugs. Let's defer freeing the reuseport cBPF prog after one RCU grace period. Note "e"BPF prog is safe as is unless the fast path starts to touch fields destroyed in bpf_prog_put_deferred() and __bpf_prog_put_noref(). [0]: BUG: KASAN: vmalloc-out-of-bounds in reuseport_select_sock+0xedc/0x1220 net/core/sock_reuseport.c:596 Read of size 4 at addr ffffc9000051e004 by task slowme/10208 CPU: 6 UID: 1000 PID: 10208 Comm: slowme Not tainted 7.0.0-geb7ac95ff75e #32 PREEMPT(full) Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 reuseport_select_sock+0xedc/0x1220 net/core/sock_reuseport.c:596 udp4_lib_lookup2+0x3bc/0x950 net/ipv4/udp.c:495 __udp4_lib_lookup+0x768/0xe20 net/ipv4/udp.c:723 __udp4_lib_lookup_skb+0x297/0x390 net/ipv4/udp.c:752 __udp4_lib_rcv+0x1312/0x2620 net/ipv4/udp.c:2752 ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318 NF_HOOK+0x30c/0x3a0 include/linux/netfilter.h:318 __netif_receive_skb_one_core net/core/dev.c:6181 [inline] __netif_receive_skb net/core/dev.c:6294 [inline] process_backlog+0xaa4/0x1960 net/core/dev.c:6645 __napi_poll+0xae/0x340 net/core/dev.c:7709 napi_poll net/core/dev.c:7772 [inline] net_rx_action+0x5d7/0xf50 net/core/dev.c:7929 handle_softirqs+0x22b/0x870 kernel/softirq.c:622 do_softirq+0x76/0xd0 kernel/softirq.c:523 </IRQ> <TASK> __local_bh_enable_ip+0xf8/0x130 kernel/softirq.c:450 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:924 [inline] __dev_queue_xmit+0x1dd7/0x3710 net/core/dev.c:4890 neigh_output include/net/neighbour.h:556 [inline] ip_finish_output2+0xca9/0x1070 net/ipv4/ip_output.c:237 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip_output+0x29f/0x450 net/ipv4/ip_output.c:438 ip_send_skb+0x45/0xc0 net/ipv4/ip_output.c:1508 udp_send_skb+0xb04/0x1510 net/ipv4/udp.c:1195 udp_sendmsg+0x1a71/0x2350 net/ipv4/udp.c:1485 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] __sys_sendto+0x554/0x680 net/socket.c:2206 __do_sys_sendto net/socket.c:2213 [inline] __se_sys_sendto net/socket.c:2209 [inline] __x64_sys_sendto+0xde/0x100 net/socket.c:2209 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x160/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x415a2d Code: b3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f6bc31e41e8 EFLAGS: 00000212 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f6bc31e4cdc RCX: 0000000000415a2d RDX: 0000000000000001 RSI: 00007f6bc31e421f RDI: 0000000000000003 RBP: 00007f6bc31e4240 R08: 00007f6bc31e4220 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000212 R12: 00007f6bc31e46c0 R13: ffffffffffffffb8 R14: 0000000000000000 R15: 00007ffc9b0d70b0 </TASK> Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Reported-by: Eulgyu Kim <eulgyukim@snu.ac.kr> Reported-by: Taeyang Lee <0wn@theori.io> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20260426012647.3233119-1-kuniyu@google.com
2026-05-08bpf: tcp: Fix type confusion in sol_tcp_sockopt().Kuniyuki Iwashima-1/+1
sol_tcp_sockopt() only checks if sk->sk_protocol is IPPROTO_TCP, but RAW socket can bypass it: socket(AF_INET, SOCK_RAW, IPPROTO_TCP) Let's use sk_is_tcp(). Note that initially sol_tcp_sockopt() checked sk->sk_prot->setsockopt. Fixes: 2ab42c7b871f ("bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt().") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260504210610.180150-7-kuniyu@google.com