aboutsummaryrefslogtreecommitdiffstats
path: root/io_uring (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-11-07io_uring: fix regbuf vector size truncationPavel Begunkov1-2/+9
There is a report of io_estimate_bvec_size() truncating the calculated number of segments that leads to corruption issues. Check it doesn't overflow "int"s used later. Rough but simple, can be improved on top. Cc: stable@vger.kernel.org Fixes: 9ef4cbbcb4ac3 ("io_uring: add infra for importing vectored reg buffers") Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-458654612@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Günther Noack <gnoack@google.com> Tested-by: Günther Noack <gnoack@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-05io_uring: fix types for region size calulationPavel Begunkov1-1/+1
->nr_pages is int, it needs type extension before calculating the region size. Fixes: a90558b36ccee ("io_uring/memmap: helper for pinning region pages") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: style fixup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03io_uring/zcrx: remove sync refill uapiPavel Begunkov3-78/+0
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL solves. The uapi can also be slightly adjusted to accommodate future extensions. Remove the feature for now, it'll be reworked for the next release. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-23io_uring: fix buffer auto-commit for multishot uring_cmdMing Lei1-11/+22
Commit 620a50c92700 ("io_uring: uring_cmd: add multishot support") added multishot uring_cmd support with explicit buffer upfront commit via io_uring_mshot_cmd_post_cqe(). However, the buffer selection path in io_ring_buffer_select() was auto-committing buffers for non-pollable files, which conflicts with uring_cmd's explicit upfront commit model. This way consumes the whole selected buffer immediately, and causes failure on the following buffer selection. Fix this by checking uring_cmd to identify operations that handle buffer commit explicitly, and skip auto-commit for these operations. Cc: Caleb Sander Mateos <csander@purestorage.com> Fixes: 620a50c92700 ("io_uring: uring_cmd: add multishot support") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-23io_uring: correct __must_hold annotation in io_install_fixed_fileAlok Tiwari1-1/+1
The __must_hold annotation references &req->ctx->uring_lock, but req is not in scope in io_install_fixed_file. This change updates the annotation to reference the correct ctx->uring_lock. improving code clarity. Fixes: f110ed8498af ("io_uring: split out fixed file installation and removal") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22io_uring: Fix code indentation errorRanganath V N2-2/+2
Fix the indentation to ensure consistent code style and improve readability and to fix the errors: ERROR: code indent should use tabs where possible + return io_net_import_vec(req, kmsg, sr->buf, sr->len, ITER_SOURCE);$ ERROR: code indent should use tabs where possible +^I^I^I struct io_big_cqe *big_cqe)$ Tested by running the /scripts/checkpatch.pl Signed-off-by: Ranganath V N <vnranganath.20@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22io_uring/sqpoll: be smarter on when to update the stime usageJens Axboe1-11/+32
The current approach is a bit naive, and hence calls the time querying way too often. Only start the "doing work" timer when there's actual work to do, and then use that information to terminate (and account) the work time once done. This greatly reduces the frequency of these calls, when they cannot have changed anyway. Running a basic random reader that is setup to use SQPOLL, a profile before this change shows these as the top cycle consumers: + 32.60% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime_adjusted + 19.97% iou-sqp-1074 [kernel.kallsyms] [k] thread_group_cputime + 12.20% io_uring io_uring [.] submitter_uring_fn + 4.13% iou-sqp-1074 [kernel.kallsyms] [k] getrusage + 2.45% iou-sqp-1074 [kernel.kallsyms] [k] io_submit_sqes + 2.18% iou-sqp-1074 [kernel.kallsyms] [k] __pi_memset_generic + 2.09% iou-sqp-1074 [kernel.kallsyms] [k] cputime_adjust and after this change, top of profile looks as follows: + 36.23% io_uring io_uring [.] submitter_uring_fn + 23.26% iou-sqp-819 [kernel.kallsyms] [k] io_sq_thread + 10.14% iou-sqp-819 [kernel.kallsyms] [k] io_sq_tw + 6.52% iou-sqp-819 [kernel.kallsyms] [k] tctx_task_work_run + 4.82% iou-sqp-819 [kernel.kallsyms] [k] nvme_submit_cmds.part.0 + 2.91% iou-sqp-819 [kernel.kallsyms] [k] io_submit_sqes [...] 0.02% iou-sqp-819 [kernel.kallsyms] [k] cputime_adjust where it's spending the cycles on things that actually matter. Reported-by: Fengnan Chang <changfengnan@bytedance.com> Cc: stable@vger.kernel.org Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22io_uring/sqpoll: switch away from getrusage() for CPU accountingJens Axboe3-18/+23
getrusage() does a lot more than what the SQPOLL accounting needs, the latter only cares about (and uses) the stime. Rather than do a full RUSAGE_SELF summation, just query the used stime instead. Cc: stable@vger.kernel.org Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-20io_uring: fix incorrect unlikely() usage in io_waitid_prep()Alok Tiwari1-1/+1
The negation operator is incorrectly placed outside the unlikely() macro: if (!unlikely(iwa)) This inverts the compiler branch prediction hint, marking the NULL case as likely instead of unlikely. The intent is to indicate that allocation failures are rare, consistent with common kernel patterns. Moving the negation inside unlikely(): if (unlikely(!iwa)) Fixes: 2b4fc4cd43f2 ("io_uring/waitid: setup async data in the prep handler") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15io_uring/rw: check for NULL io_br_sel when putting a bufferJens Axboe1-1/+5
Both the read and write side use kiocb_done() to finish a request, and kiocb_done() will call io_put_kbuf() in case a provided buffer was used for the request. Provided buffers are not supported for writes, hence NULL is being passed in. This normally works fine, as io_put_kbuf() won't actually use the value unless REQ_F_BUFFER_RING or REQ_F_BUFFER_SELECTED is set in the request flags. But depending on compiler (or whether or not CONFIG_CC_OPTIMIZE_FOR_SIZE is set), that may be done even though the value is never used. This will then cause a NULL pointer dereference. Make it a bit more obvious and check for a NULL io_br_sel, and don't even bother calling io_put_kbuf() for that case. Fixes: 5fda51255439 ("io_uring/kbuf: switch to storing struct io_buffer_list locally") Reported-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15io_uring: fix unexpected placement on same size resizingPavel Begunkov1-7/+0
There might be many reasons why a user is resizing a ring, e.g. moving to huge pages or for some memory compaction using IORING_SETUP_NO_MMAP. Don't bypass resizing, the user will definitely be surprised seeing 0 while the rings weren't actually moved to a new place. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-15io_uring: protect mem region deregistrationPavel Begunkov1-0/+1
io_create_region_mmap_safe() protects publishing of a region against concurrent mmap calls, however we should also protect against it when removing a region. There is a gap io_register_mem_region() where it safely publishes a region, but then copy_to_user goes wrong and it unsafely frees the region. Cc: stable@vger.kernel.org Fixes: 087f997870a94 ("io_uring/memmap: implement mmap for regions") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-13Revert "io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()"Jens Axboe1-1/+1
This reverts commit 90bfb28d5fa8127a113a140c9791ea0b40ab156a. Kevin reports that this commit causes an issue for him with LVM snapshots, most likely because of turning off NOWAIT support while a snapshot is being created. This makes -EOPNOTSUPP bubble back through the completion handler, where io_uring read/write handling should just retry it. Reinstate the previous check removed by the referenced commit. Cc: stable@vger.kernel.org Fixes: 90bfb28d5fa8 ("io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common()") Reported-by: Salvatore Bonaccorso <carnil@debian.org> Reported-by: Kevin Lumik <kevin@xf.ee> Link: https://lore.kernel.org/io-uring/cceb723c-051b-4de2-9a4c-4aa82e1619ee@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-08io_uring/zcrx: increment fallback loop src offsetPavel Begunkov1-0/+1
Don't forget to adjust the source offset in io_copy_page(), otherwise it'll be copying into the same location in some cases for highmem setups. Fixes: e67645bb7f3f4 ("io_uring/zcrx: prepare fallback for larger pages") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-08io_uring/zcrx: fix overshooting recv limitPavel Begunkov1-0/+4
It's reported that sometimes a zcrx request can receive more than was requested. It's caused by io_zcrx_recv_skb() adjusting desc->count for all received buffers including frag lists, but then doing recursive calls to process frag list skbs, which leads to desc->count double accounting and underflow. Reported-and-tested-by: Matthias Jasny <matthiasjasny@gmail.com> Fixes: 6699ec9a23f85 ("io_uring/zcrx: add a read limit to recvzc requests") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-07io_uring/waitid: always prune wait queue entry in io_waitid_wait()Jens Axboe1-1/+2
For a successful return, always remove our entry from the wait queue entry list. Previously this was skipped if a cancelation was in progress, but this can race with another invocation of the wait queue entry callback. Cc: stable@vger.kernel.org Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Reported-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Tested-by: syzbot+b9e83021d9c642a33d8c@syzkaller.appspotmail.com Link: https://lore.kernel.org/io-uring/68e5195e.050a0220.256323.001f.GAE@google.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-03io_uring: update liburing git URLJens Axboe1-1/+1
Change the liburing git URL to point to the git.kernel.org servers, rather than my private git.kernel.dk server. Due to continued AI scraping of cgit etc, it's becoming quite the chore to maintain a private git server. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-02Merge tag 'net-next-6.18' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S) - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios - Improve netfilter matching performance on large sets - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes - Allow bridges to redirect to a backup port when the bridge port is administratively down - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups - Add RCU safety to dst->dev, closing a lot of possible races - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication - Supports pulling data from an skb frag into the linear area of an XDP buffer Things we sprinkled into general kernel code: - Generate netlink documentation from YAML using an integrated YAML parser Driver API: - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor Device drivers: - Add a new Huawei driver for 3rd gen NIC (hinic3) - Add a new SpacemiT driver for K1 ethernet MAC - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): - support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k): - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs" * tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits) net: stmmac: Add support for Allwinner A523 GMAC200 dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible Revert "Documentation: net: add flow control guide and document ethtool API" octeontx2-pf: fix bitmap leak octeontx2-vf: fix bitmap leak net/mlx5e: Use extack in set rxfh callback net/mlx5e: Introduce mlx5e_rss_params for RSS configuration net/mlx5e: Introduce mlx5e_rss_init_params net/mlx5e: Remove unused mdev param from RSS indir init net/mlx5: Improve QoS error messages with actual depth values net/mlx5e: Prevent entering switchdev mode with inconsistent netns net/mlx5: HWS, Generalize complex matchers net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs selftests/net: add tcp_port_share to .gitignore Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set" net: add NUMA awareness to skb_attempt_defer_free() net: use llist for sd->defer_list net: make softnet_data.defer_count an atomic selftests: drv-net: psp: add tests for destroying devices selftests: drv-net: psp: add test for auto-adjusting TCP MSS ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds25-406/+839
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski8-37/+15
Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-21io_uring/zcrx: remove nth_page() usage within folioDavid Hildenbrand1-2/+2
Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the pages would span multiple folios. So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily. The dst page is currently always a single page, so we're not actually iterating anything. Link: https://lkml.kernel.org/r/20250901150359.867252-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-20io_uring: fix nvme's 32b cqes on mixed cqKeith Busch2-5/+8
The nvme uring_cmd only uses 32b CQEs. If the ring uses a mixed CQ, then we need to make sure we flag the completion as a 32b CQE. On the other hand, if nvme uring_cmd was using a dedicated 32b CQE, the posting was missing the extra memcpy because it only applied to bit CQEs on a mixed CQ. Fixes: e26dca67fde1943 ("io_uring: add support for IORING_SETUP_CQE_MIXED") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19Merge tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linuxLinus Torvalds8-37/+15
Pull io_uring fixes from Jens Axboe: - Fix for a regression introduced in the io-wq worker creation logic. - Remove the allocation cache for the msg_ring io_kiocb allocations. I have a suspicion that there's a bug there, and since we just fixed one in that area, let's just yank the use of that cache entirely. It's not that important, and it kills some code. - Treat a closed ring like task exiting in that any requests that trigger post that condition should just get canceled. Doesn't fix any real issues, outside of having tasks being able to rely on that guarantee. - Fix for a bug in the network zero-copy notification mechanism, where a comparison for matching tctx/ctx for notifications was buggy in that it didn't correctly compare with the previous notification. * tag 'io_uring-6.17-20250919' of git://git.kernel.dk/linux: io_uring: fix incorrect io_kiocb reference in io_link_skb io_uring/msg_ring: kill alloc_cache for io_kiocb allocations io_uring: include dying ring in task_work "should cancel" state io_uring/io-wq: fix `max_workers` breakage and `nr_workers` underflow
2025-09-19io_uring/query: cap number of queriesPavel Begunkov1-1/+5
If a query chain forms a cycle, it'll be looping in the kernel until the process is killed. It might be fine as any such mistake can be easily uncovered during testing, but it's still nicer to let it break out of the syscall if it executed too many queries. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19io_uring/query: prevent infinite loopsPavel Begunkov1-0/+4
If the query chain forms a cycle, the interface will loop indefinitely. Make sure it handles fatal signals, so the user can kill the process and hence break out of the infinite loop. Fixes: c265ae75f900 ("io_uring: introduce io_uring querying") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19io_uring: fix incorrect io_kiocb reference in io_link_skbYang Xiuwei1-1/+1
In io_link_skb function, there is a bug where prev_notif is incorrectly assigned using 'nd' instead of 'prev_nd'. This causes the context validation check to compare the current notification with itself instead of comparing it with the previous notification. Fix by using the correct prev_nd parameter when obtaining prev_notif. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6fe4220912d19 ("io_uring/notif: implement notification stacking") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-18io_uring/msg_ring: kill alloc_cache for io_kiocb allocationsJens Axboe2-26/+2
A recent commit: fc582cd26e88 ("io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU") fixed an issue with not deferring freeing of io_kiocb structs that msg_ring allocates to after the current RCU grace period. But this only covers requests that don't end up in the allocation cache. If a request goes into the alloc cache, it can get reused before it is sane to do so. A recent syzbot report would seem to indicate that there's something there, however it may very well just be because of the KASAN poisoning that the alloc_cache handles manually. Rather than attempt to make the alloc_cache sane for that use case, just drop the usage of the alloc_cache for msg_ring request payload data. Fixes: 50cf5f3842af ("io_uring/msg_ring: add an alloc cache for io_kiocb entries") Link: https://lore.kernel.org/io-uring/68cc2687.050a0220.139b6.0005.GAE@google.com/ Reported-by: syzbot+baa2e0f4e02df602583e@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-18io_uring: include dying ring in task_work "should cancel" stateJens Axboe5-7/+9
When running task_work for an exiting task, rather than perform the issue retry attempt, the task_work is canceled. However, this isn't done for a ring that has been closed. This can lead to requests being successfully completed post the ring being closed, which is somewhat confusing and surprising to an application. Rather than just check the task exit state, also include the ring ref state in deciding whether or not to terminate a given request when run from task_work. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/axboe/liburing/discussions/1459 Reported-by: Benedek Thaler <thaler@thaler.hu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: account niov arrays to cgroupPavel Begunkov1-3/+3
net_iov / freelist / etc. arrays can be quite long, make sure they're accounted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: allow synchronous buffer returnPavel Begunkov3-0/+78
Returning buffers via a ring is performant and convenient, but it becomes a problem when/if the user misconfigured the ring size and it becomes full. Add a synchronous way to return buffers back to the page pool via a new register opcode. It's supposed to be a reliable slow path for refilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: introduce io_parse_rqe()Pavel Begunkov1-13/+23
Add a helper for verifying a rqe and extracting a niov out of it. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: don't adjust free cache spacePavel Begunkov1-1/+1
The cache should be empty when io_pp_zc_alloc_netmems() is called, that's promised by page pool and further checked, so there is no need to recalculate the available space in io_zcrx_ring_refill(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: use guards for the refill lockPavel Begunkov1-5/+2
Use guards for rq_lock in io_zcrx_ring_refill(), makes it a tad simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: reduce netmem scope in refillPavel Begunkov1-1/+1
Reduce the scope of a local var netmem in io_zcrx_ring_refill. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: protect netdev with pp_lockPavel Begunkov2-13/+11
Remove ifq->lock and reuse pp_lock to protect the netdev pointer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: rename dma lockPavel Begunkov2-5/+10
In preparation for reusing the lock for other purposes, rename it to "pp_lock". As before, it can be taken deeper inside the networking stack by page pool, and so the syscall io_uring must avoid holding it while doing queue reconfiguration or anything that can result in immediate pp init/destruction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: make niov size variablePavel Begunkov2-10/+21
Instead of using PAGE_SIZE for the niov size add a niov_shift field to ifq, and patch up all important places. Copy fallback still assumes PAGE_SIZE, so it'll be wasting some memory for now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: set sgt for umem areaPavel Begunkov2-9/+7
Set struct io_zcrx_mem::sgt for umem areas as well to simplify looking up the current sg table. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: remove dmabuf_offsetPavel Begunkov2-12/+2
It was removed from uapi, so now it's always 0 and can be removed together with offset handling in io_populate_area_dma(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: deduplicate area mappingPavel Begunkov1-23/+14
With a common type for storing dma addresses and io_populate_area_dma(), type-specific area mapping helpers are trivial, so open code them and deduplicate the call to io_populate_area_dma(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()Pavel Begunkov1-6/+6
io_zcrx_copy_chunk() doesn't and shouldn't care from which area the buffer is allocated, don't try to resolve the area in it but pass the ifq to io_zcrx_alloc_fallback() and let it handle it. Also rename it for more clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: check all niovs filled with dma addressesPavel Begunkov1-0/+3
Add a warning if io_populate_area_dma() can't fill in all net_iovs, it should never happen. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: move area reg checks into io_import_areaPavel Begunkov1-9/+9
io_import_area() is responsible for importing memory and parsing io_uring_zcrx_area_reg, so move all area reg structure checks into the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: don't pass slot to io_zcrx_create_areaPavel Begunkov1-4/+14
Don't pass a pointer to a pointer where an area should be stored to io_zcrx_create_area(), and let it handle finding the right place for a new area. It's more straightforward and will be needed to support multiple areas. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: remove extra io_zcrx_drop_netdevPavel Begunkov1-1/+0
io_close_queue() already detaches the netdev, don't unnecessary call io_zcrx_drop_netdev() right after. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: use page_pool_unref_and_test()Pavel Begunkov1-1/+1
page_pool_unref_and_test() tries to better follow usuall refcount semantics, use it instead of page_pool_unref_netmem(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: replace memchar_inv with is_zeroPavel Begunkov1-1/+1
memchr_inv() is more ambiguous than mem_is_zero(), so use the latter for zero checks. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16io_uring/zcrx: improve rqe cache alignmentPavel Begunkov1-4/+5
Refill queue entries are 16B structures, but because of the ring header placement, they're 8B aligned but not naturally / 16B aligned, which means some of them span across 2 cache lines. Push rqes to a new cache line. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15io_uring/zcrx: fix ifq->if_rxq is -1, get dma_dev is NULLFeng Zhou1-1/+1
ifq->if_rxq has not been assigned, is -1, the correct value is in reg.if_rxq. Fixes: 59b8b32ac8d469958936fcea781c7f58e3d64742 ("io_uring/zcrx: add support for custom DMA devices") Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://patch.msgid.link/20250912140133.97741-1-zhoufeng.zf@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>