aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-11-08Merge tag 'v6.18rc4-SMB-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds3-9/+16
Pull smb client fixes from Steve French: - Fix change notify packet validation check - Refcount fix (e.g. rename error paths) - Fix potential UAF due to missing locks on directory lease refcount * tag 'v6.18rc4-SMB-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: validate change notify buffer before copy smb: client: fix refcount leak in smb2_set_path_attr smb: client: fix potential UAF in smb2_close_cached_fid()
2025-11-08Merge tag 'xfs-fixes-6.18-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-16/+76
Pull xfs fixes from Carlos Maiolino: "This contain fixes for the RT and zoned allocator, and a few fixes for atomic writes" * tag 'xfs-fixes-6.18-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: free xfs_busy_extents structure when no RT extents are queued xfs: fix zone selection in xfs_select_open_zone_mru xfs: fix a rtgroup leak when xfs_init_zone fails xfs: fix various problems in xfs_atomic_write_cow_iomap_begin xfs: fix delalloc write failures in software-provided atomic writes
2025-11-07smb: client: validate change notify buffer before copyJoshua Rogers1-2/+5
SMB2_change_notify called smb2_validate_iov() but ignored the return code, then kmemdup()ed using server provided OutputBufferOffset/Length. Check the return of smb2_validate_iov() and bail out on error. Discovered with help from the ZeroPath security tooling. Signed-off-by: Joshua Rogers <linux@joshua.hu> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: stable@vger.kernel.org Fixes: e3e9463414f61 ("smb3: improve SMB3 change notification support") Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-07Merge tag 'v6.18-rc4-smb-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds1-1/+23
Pull smb server fixes from Steve French: - More safely detect RDMA capable devices correctly * tag 'v6.18-rc4-smb-server-fixes' of git://git.samba.org/ksmbd: ksmbd: detect RDMA capable netdevs include IPoIB ksmbd: detect RDMA capable lower devices when bridge and vlan netdev is used
2025-11-06Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds1-2/+1
Pull fscrypt fix from Eric Biggers: "Fix an UBSAN warning that started occurring when the block layer started supporting logical_block_size > PAGE_SIZE" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: fix left shift underflow when inode->i_blkbits > PAGE_SHIFT
2025-11-06xfs: free xfs_busy_extents structure when no RT extents are queuedChristoph Hellwig1-1/+3
kmemleak occasionally reports leaking xfs_busy_extents structure from xfs_scrub calls after running xfs/528 (but attributed to following tests), which seems to be caused by not freeing the xfs_busy_extents structure when tr.queued is 0 and xfs_trim_rtgroup_extents breaks out of the main loop. Free the structure in this case. Fixes: a3315d11305f ("xfs: use rtgroup busy extent list for FITRIM") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix zone selection in xfs_select_open_zone_mruChristoph Hellwig1-1/+1
xfs_select_open_zone_mru needs to pass XFS_ZONE_ALLOC_OK to xfs_try_use_zone because we only want to tightly pack into zones of the same or a compatible temperature instead of any available zone. This got broken in commit 0301dae732a5 ("xfs: refactor hint based zone allocation"), which failed to update this particular caller when switching to an enum. xfs/638 sometimes, but not reliably fails due to this change. Fixes: 0301dae732a5 ("xfs: refactor hint based zone allocation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix a rtgroup leak when xfs_init_zone failsChristoph Hellwig1-1/+3
Drop the rtgrop reference when xfs_init_zone fails for a conventional device. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix various problems in xfs_atomic_write_cow_iomap_beginDarrick J. Wong1-11/+50
I think there are several things wrong with this function: A) xfs_bmapi_write can return a much larger unwritten mapping than what the caller asked for. We convert part of that range to written, but return the entire written mapping to iomap even though that's inaccurate. B) The arguments to xfs_reflink_convert_cow_locked are wrong -- an unwritten mapping could be *smaller* than the write range (or even the hole range). In this case, we convert too much file range to written state because we then return a smaller mapping to iomap. C) It doesn't handle delalloc mappings. This I covered in the patch that I already sent to the list. D) Reassigning count_fsb to handle the hole means that if the second cmap lookup attempt succeeds (due to racing with someone else) we trim the mapping more than is strictly necessary. The changing meaning of count_fsb makes this harder to notice. E) The tracepoint is kinda wrong because @length is mutated. That makes it harder to chase the data flows through this function because you can't just grep on the pos/bytecount strings. F) We don't actually check that the br_state = XFS_EXT_NORM assignment is accurate, i.e that the cow fork actually contains a written mapping for the range we're interested in G) Somewhat inadequate documentation of why we need to xfs_trim_extent so aggressively in this function. H) Not sure why xfs_iomap_end_fsb is used here, the vfs already clamped the write range to s_maxbytes. Fix these issues, and then the atomic writes regressions in generic/760, generic/617, generic/091, generic/263, and generic/521 all go away for me. Cc: stable@vger.kernel.org # v6.16 Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-05xfs: fix delalloc write failures in software-provided atomic writesDarrick J. Wong1-2/+19
With the 20 Oct 2025 release of fstests, generic/521 fails for me on regular (aka non-block-atomic-writes) storage: QA output created by 521 dowrite: write: Input/output error LOG DUMP (8553 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): WRITE 0x7e000 thru 0x8dfff (0x10000 bytes) HOLE 3( 3 mod 256): READ 0x69000 thru 0x79fff (0x11000 bytes) 4( 4 mod 256): FALLOC 0x53c38 thru 0x5e853 (0xac1b bytes) INTERIOR 5( 5 mod 256): COPY 0x55000 thru 0x59fff (0x5000 bytes) to 0x25000 thru 0x29fff 6( 6 mod 256): WRITE 0x74000 thru 0x88fff (0x15000 bytes) 7( 7 mod 256): ZERO 0xedb1 thru 0x11693 (0x28e3 bytes) with a warning in dmesg from iomap about XFS trying to give it a delalloc mapping for a directio write. Fix the software atomic write iomap_begin code to convert the reservation into a written mapping. This doesn't fix the data corruption problems reported by generic/760, but it's a start. Cc: stable@vger.kernel.org # v6.16 Fixes: bd1d2c21d5d249 ("xfs: add xfs_atomic_write_cow_iomap_begin()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-11-04fscrypt: fix left shift underflow when inode->i_blkbits > PAGE_SHIFTYongpeng Yang1-2/+1
When simulating an nvme device on qemu with both logical_block_size and physical_block_size set to 8 KiB, an error trace appears during partition table reading at boot time. The issue is caused by inode->i_blkbits being larger than PAGE_SHIFT, which leads to a left shift of -1 and triggering a UBSAN warning. [ 2.697306] ------------[ cut here ]------------ [ 2.697309] UBSAN: shift-out-of-bounds in fs/crypto/inline_crypt.c:336:37 [ 2.697311] shift exponent -1 is negative [ 2.697315] CPU: 3 UID: 0 PID: 274 Comm: (udev-worker) Not tainted 6.18.0-rc2+ #34 PREEMPT(voluntary) [ 2.697317] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 2.697320] Call Trace: [ 2.697324] <TASK> [ 2.697325] dump_stack_lvl+0x76/0xa0 [ 2.697340] dump_stack+0x10/0x20 [ 2.697342] __ubsan_handle_shift_out_of_bounds+0x1e3/0x390 [ 2.697351] bh_get_inode_and_lblk_num.cold+0x12/0x94 [ 2.697359] fscrypt_set_bio_crypt_ctx_bh+0x44/0x90 [ 2.697365] submit_bh_wbc+0xb6/0x190 [ 2.697370] block_read_full_folio+0x194/0x270 [ 2.697371] ? __pfx_blkdev_get_block+0x10/0x10 [ 2.697375] ? __pfx_blkdev_read_folio+0x10/0x10 [ 2.697377] blkdev_read_folio+0x18/0x30 [ 2.697379] filemap_read_folio+0x40/0xe0 [ 2.697382] filemap_get_pages+0x5ef/0x7a0 [ 2.697385] ? mmap_region+0x63/0xd0 [ 2.697389] filemap_read+0x11d/0x520 [ 2.697392] blkdev_read_iter+0x7c/0x180 [ 2.697393] vfs_read+0x261/0x390 [ 2.697397] ksys_read+0x71/0xf0 [ 2.697398] __x64_sys_read+0x19/0x30 [ 2.697399] x64_sys_call+0x1e88/0x26a0 [ 2.697405] do_syscall_64+0x80/0x670 [ 2.697410] ? __x64_sys_newfstat+0x15/0x20 [ 2.697414] ? x64_sys_call+0x204a/0x26a0 [ 2.697415] ? do_syscall_64+0xb8/0x670 [ 2.697417] ? irqentry_exit_to_user_mode+0x2e/0x2a0 [ 2.697420] ? irqentry_exit+0x43/0x50 [ 2.697421] ? exc_page_fault+0x90/0x1b0 [ 2.697422] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 2.697425] RIP: 0033:0x75054cba4a06 [ 2.697426] Code: 5d e8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 75 19 83 e2 39 83 fa 08 75 11 e8 26 ff ff ff 66 0f 1f 44 00 00 48 8b 45 10 0f 05 <48> 8b 5d f8 c9 c3 0f 1f 40 00 f3 0f 1e fa 55 48 89 e5 48 83 ec 08 [ 2.697427] RSP: 002b:00007fff973723a0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000 [ 2.697430] RAX: ffffffffffffffda RBX: 00005ea9a2c02760 RCX: 000075054cba4a06 [ 2.697432] RDX: 0000000000002000 RSI: 000075054c190000 RDI: 000000000000001b [ 2.697433] RBP: 00007fff973723c0 R08: 0000000000000000 R09: 0000000000000000 [ 2.697434] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 [ 2.697434] R13: 00005ea9a2c027c0 R14: 00005ea9a2be5608 R15: 00005ea9a2be55f0 [ 2.697436] </TASK> [ 2.697436] ---[ end trace ]--- This situation can happen for block devices because when CONFIG_TRANSPARENT_HUGEPAGE is enabled, the maximum logical_block_size is 64 KiB. set_init_blocksize() then sets the block device inode->i_blkbits to 13, which is within this limit. File I/O does not trigger this problem because for filesystems that do not support the FS_LBS feature, sb_set_blocksize() prevents sb->s_blocksize_bits from being larger than PAGE_SHIFT. During inode allocation, alloc_inode()->inode_init_always() assigns inode->i_blkbits from sb->s_blocksize_bits. Currently, only xfs_fs_type has the FS_LBS flag, and since xfs I/O paths do not reach submit_bh_wbc(), it does not hit the left-shift underflow issue. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Fixes: 47dd67532303 ("block/bdev: lift block size restrictions to 64k") Cc: stable@vger.kernel.org [EB: use folio_pos() and consolidate the two shifts by i_blkbits] Link: https://lore.kernel.org/r/20251105003642.42796-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-11-04smb: client: fix refcount leak in smb2_set_path_attrShuhao Fu1-0/+2
Fix refcount leak in `smb2_set_path_attr` when path conversion fails. Function `cifs_get_writable_path` returns `cfile` with its reference counter `cfile->count` increased on success. Function `smb2_compound_op` would decrease the reference counter for `cfile`, as stated in its comment. By calling `smb2_rename_path`, the reference counter of `cfile` would leak if `cifs_convert_path_to_utf16` fails in `smb2_set_path_attr`. Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name") Acked-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Shuhao Fu <sfual@cse.ust.hk> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04smb: client: fix potential UAF in smb2_close_cached_fid()Henrique Carvalho1-7/+9
find_or_create_cached_dir() could grab a new reference after kref_put() had seen the refcount drop to zero but before cfid_list_lock is acquired in smb2_close_cached_fid(), leading to use-after-free. Switch to kref_put_lock() so cfid_release() is called with cfid_list_lock held, closing that gap. Fixes: ebe98f1447bb ("cifs: enable caching of directories for which a lease is held") Cc: stable@vger.kernel.org Reported-by: Jay Shin <jaeshin@redhat.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04ksmbd: detect RDMA capable netdevs include IPoIBNamjae Jeon1-0/+4
Current ksmbd_rdma_capable_netdev fails to mark certain RDMA-capable inerfaces such as IPoIB as RDMA capable after reverting GUID matching code due to layer violation. This patch check the ARPHRD_INFINIBAND type safely identifies an IPoIB interface without introducing a layer violation, ensuring RDMA functionality is correctly enabled for these interfaces. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04ksmbd: detect RDMA capable lower devices when bridge and vlan netdev is usedNamjae Jeon1-1/+19
If user set bridge interface as actual RDMA-capable NICs are lower devices, ksmbd can not detect as RDMA capable. This patch can detect the RDMA capable lower devices from bridge master or VLAN. With this change, ksmbd can accept both TCP and RDMA connections through the same bridge IP address, allowing mixed transport operation without requiring separate interfaces. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-11-04Merge tag 'for-6.18-rc4-tag' of ↵Linus Torvalds5-2/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix memory leak in qgroup relation ioctl when qgroup levels are invalid - don't write back dirty metadata on filesystem with errors - properly log renamed links - properly mark prealloc extent range beyond inode size as dirty (when no-noles is not enabled) * tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: mark dirty extent range for out of bound prealloc extents btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation btrfs: ensure no dirty metadata is written back for an fs with errors
2025-11-01Merge tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds3-0/+41
Pull xfs fixes from Carlos Maiolino: "Just a single bug fix (and documentation for the issue)" * tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: document another racy GC case in xfs_zoned_map_extent xfs: prevent gc from picking the same zone twice
2025-10-31Merge tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds5-37/+71
Pull smb client fixes from Steve French: - fix potential UAF in statfs - DFS fix for expired referrals - fix minor modinfo typo - small improvement to reconnect for smbdirect * tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release() smb: client: handle lack of IPC in dfs_cache_refresh() smb: client: fix potential cfid UAF in smb2_query_info_compound cifs: fix typo in enable_gcm_256 module parameter
2025-10-31xfs: document another racy GC case in xfs_zoned_map_extentChristoph Hellwig1-0/+8
Besides blocks being invalidated, there is another case when the original mapping could have changed between querying the rmap for GC and calling xfs_zoned_map_extent. Document it there as it took us quite some time to figure out what is going on while developing the multiple-GC protection fix. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-31xfs: prevent gc from picking the same zone twiceChristoph Hellwig2-0/+33
When we are picking a zone for gc it might already be in the pipeline which can lead to us moving the same data twice resulting in in write amplification and a very unfortunate case where we keep on garbage collecting the zone we just filled with migrated data stopping all forward progress. Fix this by introducing a count of on-going GC operations on a zone, and skip any zone with ongoing GC when picking a new victim. Fixes: 080d01c41 ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Co-developed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-10-30btrfs: mark dirty extent range for out of bound prealloc extentsaustinchang1-0/+10
In btrfs_fallocate(), when the allocated range overlaps with a prealloc extent and the extent starts after i_size, the range doesn't get marked dirty in file_extent_tree. This results in persisting an incorrect disk_i_size for the inode when not using the no-holes feature. This is reproducible since commit 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree"), then became hidden since commit 3d7db6e8bd22 ("btrfs: don't allocate file extent tree for non regular files") and then visible again after commit 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set"), which fixes the previous commit. The following reproducer triggers the problem: $ cat test.sh MNT=/mnt/test DEV=/dev/vdb mkdir -p $MNT mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file1 fallocate -n -o 1M -l 2M $MNT/file1 umount $MNT mount $DEV $MNT len=$((1 * 1024 * 1024)) fallocate -o 1M -l $len $MNT/file1 du --bytes $MNT/file1 umount $MNT mount $DEV $MNT du --bytes $MNT/file1 umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 2097152 /mnt/test/file1 1048576 /mnt/test/file1 The difference is exactly 1048576 as we assigned. Fix by adding a call to btrfs_inode_set_file_extent_range() in btrfs_fallocate_update_isize(). Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") Signed-off-by: austinchang <austinchang@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new nameFilipe Manana2-1/+3
If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relationShardul Bankar1-1/+3
When btrfs_add_qgroup_relation() is called with invalid qgroup levels (src >= dst), the function returns -EINVAL directly without freeing the preallocated qgroup_list structure passed by the caller. This causes a memory leak because the caller unconditionally sets the pointer to NULL after the call, preventing any cleanup. The issue occurs because the level validation check happens before the mutex is acquired and before any error handling path that would free the prealloc pointer. On this early return, the cleanup code at the 'out' label (which includes kfree(prealloc)) is never reached. In btrfs_ioctl_qgroup_assign(), the code pattern is: prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL); ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc); prealloc = NULL; // Always set to NULL regardless of return value ... kfree(prealloc); // This becomes kfree(NULL), does nothing When the level check fails, 'prealloc' is never freed by either the callee or the caller, resulting in a 64-byte memory leak per failed operation. This can be triggered repeatedly by an unprivileged user with access to a writable btrfs mount, potentially exhausting kernel memory. Fix this by freeing prealloc before the early return, ensuring prealloc is always freed on all error paths. Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: ensure no dirty metadata is written back for an fs with errorsQu Wenruo1-0/+8
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-29smb: client: call smbd_destroy() in the same splace as ↵Stefan Metzmacher1-6/+2
kernel_sock_shutdown()/sock_release() With commit b0432201a11b ("smb: client: let destroy_mr_list() keep smbdirect_mr_io memory if registered") the changes from commit 214bab448476 ("cifs: Call MID callback before destroying transport") and commit 1d2a4f57cebd ("cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called") are no longer needed. And it's better to use the same logic flow, so that the chance of smbdirect related problems is smaller. Fixes: 214bab448476 ("cifs: Call MID callback before destroying transport") Fixes: 1d2a4f57cebd ("cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called") Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-29smb: client: handle lack of IPC in dfs_cache_refresh()Paulo Alcantara3-29/+66
In very rare cases, DFS mounts could end up with SMB sessions without any IPC connections. These mounts are only possible when having unexpired cached DFS referrals, hence not requiring any IPC connections during the mount process. Try to establish those missing IPC connections when refreshing DFS referrals. If the server is still rejecting it, then simply ignore and leave expired cached DFS referral for any potential DFS failovers. Reported-by: Jay Shin <jaeshin@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Cc: David Howells <dhowells@redhat.com> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-28Merge tag 'v6.18-rc3-smb-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds2-12/+43
Pull smb server fixes from Steve French: - Improve check for malformed payload - Fix free transport smbdirect potential race - Fix potential race in credit allocation during smbdirect negotiation * tag 'v6.18-rc3-smb-server-fixes' of git://git.samba.org/ksmbd: smb: server: let smb_direct_cm_handler() call ib_drain_qp() after smb_direct_disconnect_rdma_work() smb: server: call smb_direct_post_recv_credits() when the negotiation is done ksmbd: transport_ipc: validate payload size before reading handle
2025-10-28Merge tag 'nfsd-6.18-2' of ↵Linus Torvalds5-12/+35
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Regression fixes: - Revert the patch that removed the cap on MAX_OPS_PER_COMPOUND - Address a kernel build issue Stable fixes: - Fix crash when a client queries new attributes on forechannel - Fix rare NFSD crash when tracing is enabled" * tag 'nfsd-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: Revert "NFSD: Remove the cap on number of operations per NFSv4 COMPOUND" nfsd: Avoid strlen conflict in nfsd4_encode_components_esc() NFSD: Fix crash in nfsd4_read_release() NFSD: Define actions for the new time_deleg FATTR4 attributes
2025-10-28smb: client: fix potential cfid UAF in smb2_query_info_compoundHenrique Carvalho1-1/+2
When smb2_query_info_compound() retries, a previously allocated cfid may have been freed in the first attempt. Because cfid wasn't reset on replay, later cleanup could act on a stale pointer, leading to a potential use-after-free. Reinitialize cfid to NULL under the replay label. Example trace (trimmed): refcount_t: underflow; use-after-free. WARNING: CPU: 1 PID: 11224 at ../lib/refcount.c:28 refcount_warn_saturate+0x9c/0x110 [...] RIP: 0010:refcount_warn_saturate+0x9c/0x110 [...] Call Trace: <TASK> smb2_query_info_compound+0x29c/0x5c0 [cifs f90b72658819bd21c94769b6a652029a07a7172f] ? step_into+0x10d/0x690 ? __legitimize_path+0x28/0x60 smb2_queryfs+0x6a/0xf0 [cifs f90b72658819bd21c94769b6a652029a07a7172f] smb311_queryfs+0x12d/0x140 [cifs f90b72658819bd21c94769b6a652029a07a7172f] ? kmem_cache_alloc+0x18a/0x340 ? getname_flags+0x46/0x1e0 cifs_statfs+0x9f/0x2b0 [cifs f90b72658819bd21c94769b6a652029a07a7172f] statfs_by_dentry+0x67/0x90 vfs_statfs+0x16/0xd0 user_statfs+0x54/0xa0 __do_sys_statfs+0x20/0x50 do_syscall_64+0x58/0x80 Cc: stable@kernel.org Fixes: 4f1fffa237692 ("cifs: commands that are retried should have replay flag set") Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26smb: server: let smb_direct_cm_handler() call ib_drain_qp() after ↵Stefan Metzmacher1-3/+8
smb_direct_disconnect_rdma_work() All handlers triggered by ib_drain_qp() should already see the broken connection. smb_direct_cm_handler() is called under a mutex of the rdma_cm, we should make sure ib_drain_qp() and all rdma layer logic completes and unlocks the mutex. It means free_transport() will also already see the connection as SMBDIRECT_SOCKET_DISCONNECTED, so we need to call crdma_[un]lock_handler(sc->rdma.cm_id) around ib_drain_qp(), rdma_destroy_qp(), ib_free_cq() and ib_dealloc_pd(). Otherwise we free resources while the ib_drain_qp() within smb_direct_cm_handler() is still running. We have to unlock before rdma_destroy_id() as it locks again. Fixes: 141fa9824c0f ("ksmbd: call ib_drain_qp when disconnected") Fixes: 4c564f03e23b ("smb: server: make use of common smbdirect_socket") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26smb: server: call smb_direct_post_recv_credits() when the negotiation is doneStefan Metzmacher1-8/+28
We now activate sc->recv_io.posted.refill_work and sc->idle.immediate_work only after a successful negotiation, before sending the negotiation response. It means the queue_work(sc->workqueue, &sc->recv_io.posted.refill_work) in put_recvmsg() of the negotiate request, is a no-op now. It also means our explicit smb_direct_post_recv_credits() will have queue_work(sc->workqueue, &sc->idle.immediate_work) as no-op. This should make sure we don't have races and post any immediate data_transfer message that tries to grant credits to the peer, before we send the negotiation response, as that will grant the initial credits to the peer. Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Fixes: 1cde0a74a7a8 ("smb: server: don't use delayed_work for post_recv_credits_work") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26ksmbd: transport_ipc: validate payload size before reading handleQianchang Zhao1-1/+7
handle_response() dereferences the payload as a 4-byte handle without verifying that the declared payload size is at least 4 bytes. A malformed or truncated message from ksmbd.mountd can lead to a 4-byte read past the declared payload size. Validate the size before dereferencing. This is a minimal fix to guard the initial handle read. Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Cc: stable@vger.kernel.org Reported-by: Qianchang Zhao <pioooooooooip@gmail.com> Signed-off-by: Qianchang Zhao <pioooooooooip@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26cifs: fix typo in enable_gcm_256 module parameterSteve French1-1/+1
Fix typo in description of enable_gcm_256 module parameter Suggested-by: Thomas Spear <speeddymon@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-26Merge tag 'x86_urgent_for_v6.18_rc3' of ↵Linus Torvalds1-9/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Remove dead code leftovers after a recent mitigations cleanup which fail a Clang build - Make sure a Retbleed mitigation message is printed only when necessary - Correct the last Zen1 microcode revision for which Entrysign sha256 check is needed - Fix a NULL ptr deref when mounting the resctrl fs on a system which supports assignable counters but where L3 total and local bandwidth monitoring has been disabled at boot * tag 'x86_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Remove dead code which might prevent from building x86/bugs: Qualify RETBLEED_INTEL_MSG x86/microcode: Fix Entrysign revision check for Zen1/Naples x86,fs/resctrl: Fix NULL pointer dereference with events force-disabled in mbm_event mode
2025-10-25Merge tag 'driver-core-6.18-rc3' of ↵Linus Torvalds1-5/+21
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - In Device::parent(), do not make any assumptions on the device context of the parent device - Check visibility before changing ownership of a sysfs attribute group - In topology_parse_cpu_capacity(), replace an incorrect usage of PTR_ERR_OR_ZERO() with IS_ERR_OR_NULL() - In devcoredump, fix a circular locking dependency between struct devcd_entry::mutex and kernfs - Do not warn about a pending fw_devlink sync state * tag 'driver-core-6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: arch_topology: Fix incorrect error check in topology_parse_cpu_capacity() rust: device: fix device context of Device::parent() sysfs: check visibility before changing group attribute ownership devcoredump: Fix circular locking dependency with devcd->mutex. driver core: fw_devlink: Don't warn about sync_state() pending
2025-10-25Merge tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds9-140/+193
Pull xfs fixes from Carlos Maiolino: "The main highlight here is a fix for a bug brought in by the removal of attr2 mount option, where some installations might actually have 'attr2' explicitly configured in fstab preventing system to boot by not being able to remount the rootfs as RW. Besides that there are a couple fix to the zonefs implementation, changing XFS_ONLINE_SCRUB_STATS to depend on DEBUG_FS (was select before), and some other minor changes" * tag 'xfs-fixes-6.18-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix locking in xchk_nlinks_collect_dir xfs: loudly complain about defunct mount options xfs: always warn about deprecated mount options xfs: don't set bt_nr_sectors to a negative number xfs: don't use __GFP_NOFAIL in xfs_init_fs_context xfs: cache open zone in inode->i_private xfs: avoid busy loops in GCD xfs: XFS_ONLINE_SCRUB_STATS should depend on DEBUG_FS xfs: do not tightly pack-write large files xfs: Improve CONFIG_XFS_RT Kconfig help
2025-10-24Merge tag 'v6.18-rc2-smb-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds3-151/+273
Pull smb server fixes from Steve French: "smbdirect (RDMA) fixes in order avoid potential submission queue overflows: - free transport teardown fix - credit related fixes (five server related, one client related)" * tag 'v6.18-rc2-smb-server-fixes' of git://git.samba.org/ksmbd: smb: server: let free_transport() wait for SMBDIRECT_SOCKET_DISCONNECTED smb: client: make use of smbdirect_socket.send_io.lcredits.* smb: server: make use of smbdirect_socket.send_io.lcredits.* smb: server: simplify sibling_list handling in smb_direct_flush_send_list/send_done smb: server: smb_direct_disconnect_rdma_connection() already wakes all waiters on error smb: smbdirect: introduce smbdirect_socket.send_io.lcredits.* smb: server: allocate enough space for RW WRs and ib_drain_qp()
2025-10-24Merge tag '6.18-rc2-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds9-39/+44
Pull smb client fixes from Steve French: - add missing tracepoints - smbdirect (RDMA) fix - fix potential issue with credits underflow - rename fix - improvement to calc_signature and additional cleanup patch * tag '6.18-rc2-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: #include cifsglob.h before trace.h to allow structs in tracepoints cifs: Call the calc_signature functions directly smb: client: get rid of d_drop() in cifs_do_rename() cifs: Fix TCP_Server_Info::credits to be signed cifs: Add a couple of missing smb3_rw_credits tracepoints smb: client: allocate enough space for MR WRs and ib_drain_qp()
2025-10-23smb: server: let free_transport() wait for SMBDIRECT_SOCKET_DISCONNECTEDStefan Metzmacher1-4/+3
We should wait for the rdma_cm to become SMBDIRECT_SOCKET_DISCONNECTED! At least on the client side (with similar code) wait_event_interruptible() often returns with -ERESTARTSYS instead of waiting for SMBDIRECT_SOCKET_DISCONNECTED. We should use wait_event() here too, which makes the code be identical in client and server, which will help when moving to common functions. Fixes: b31606097de8 ("smb: server: move smb_direct_disconnect_rdma_work() into free_transport()") Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23Merge tag 'for-6.18-rc2-tag' of ↵Linus Torvalds5-11/+64
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in send, fix duplicated rmdir operations when using extrefs (hardlinks), receive can fail with ENOENT - fixup of error check when reading extent root in ref-verify and damaged roots are allowed by mount option (found by smatch) - fix freeing partially initialized fs info (found by syzkaller) - fix use-after-free when printing ref_tracking status of delayed inodes * tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree() btrfs: fix delayed_node ref_tracker use after free btrfs: send: fix duplicated rmdir operations when using extrefs btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()
2025-10-23cifs: #include cifsglob.h before trace.h to allow structs in tracepointsDavid Howells2-0/+2
Make cifs #include cifsglob.h in advance of #including trace.h so that the structures defined in cifsglob.h can be accessed directly by the cifs tracepoints rather than the callers having to manually pass in the bits and pieces. This should allow the tracepoints to be made more efficient to use as well as easier to read in the code. Signed-off-by: David Howells <dhowells@redhat.com> cc: Paulo Alcantara <pc@manguebit.org> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23cifs: Call the calc_signature functions directlyDavid Howells4-21/+9
As the SMB1 and SMB2/3 calc_signature functions are called from separate sign and verify paths, just call them directly rather than using a function pointer. The SMB3 calc_signature then jumps to the SMB2 variant if necessary. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Enzo Matsumiya <ematsumiya@suse.de> cc: Paulo Alcantara <pc@manguebit.org> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-23smb: client: get rid of d_drop() in cifs_do_rename()Paulo Alcantara1-4/+1
There is no need to force a lookup by unhashing the moved dentry after successfully renaming the file on server. The file metadata will be re-fetched from server, if necessary, in the next call to ->d_revalidate() anyways. Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22cifs: Fix TCP_Server_Info::credits to be signedDavid Howells1-1/+1
Fix TCP_Server_Info::credits to be signed, just as echo_credits and oplock_credits are. This also fixes what ought to get at least a compilation warning if not an outright error in *get_credits_field() as a pointer to the unsigned server->credits field is passed back as a pointer to a signed int. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-cifs@vger.kernel.org Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Acked-by: Pavel Shilovskiy <pshilovskiy@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: client: make use of smbdirect_socket.send_io.lcredits.*Stefan Metzmacher1-25/+42
This makes the logic to prevent on overflow of the send submission queue with ib_post_send() easier. As we first get a local credit and then a remote credit before we mark us as pending. For now we'll keep the logic around smbdirect_socket.send_io.pending.*, but that will likely change or be removed completely. The server will get a similar logic soon, so we'll be able to share the send code in future. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: server: make use of smbdirect_socket.send_io.lcredits.*Stefan Metzmacher1-5/+37
This introduces logic to prevent on overflow of the send submission queue with ib_post_send() easier. As we first get a local credit and then a remote credit before we mark us as pending. From reading the git history of the linux smbdirect implementations in client and server) it was seen that a peer granted more credits than we requested. I guess that only happened because of bugs in our implementation which was active as client and server. I guess Windows won't do that. So the local credits make sure we only use the amount of credits we asked for. Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: server: simplify sibling_list handling in ↵Stefan Metzmacher1-22/+38
smb_direct_flush_send_list/send_done We have a list handling that is much easier to understand: 1. Before smb_direct_flush_send_list() is called all struct smbdirect_send_io messages are part of send_ctx->msg_list 2. Before smb_direct_flush_send_list() calls smb_direct_post_send() we remove the last element in send_ctx->msg_list and move all others into last->sibling_list. As only last has IB_SEND_SIGNALED and gets a completion vis send_done(). 3. send_done() has an easy way to free all others in sendmsg->sibling_list (if there are any). And use list_for_each_entry_safe() instead of a complex custom logic. This will help us to share send_done() in common code soon, as it will work fine for the client too, where last->sibling_list is currently always an empty list. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: server: smb_direct_disconnect_rdma_connection() already wakes all ↵Stefan Metzmacher1-4/+0
waiters on error There's no need to care about pending or credit counters when we already disconnecting. And all related wait_event conditions already check for broken connections too. This will simplify the code and makes the following changes simpler. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: smbdirect: introduce smbdirect_socket.send_io.lcredits.*Stefan Metzmacher1-1/+12
This will be used to implement a logic in order to make sure we don't overflow the send submission queue for ib_post_send(). We will initialize the local credits with the fixed sp->send_credit_target value, which matches the reserved slots in the submission queue for ib_post_send(). We will be a local credit first and then wait for a remote credit, if we managed to get both we are allowed to post an IB_WR_SEND[_WITH_INV]. The local credit is given back to the pool when we get the local ib_post_send() completion, while remote credits are granted by the peer. From reading the git history of the linux smbdirect implementations in client and server) it was seen that a peer granted more credits than we requested. I guess that only happened because of bugs in our implementation which was active as client and server. I guess Windows won't do that. So the local credits make sure we only use the amount of credits we asked for. The client already has some logic for this based on smbdirect_socket.send_io.pending.count, but that counts in the order direction and makes it complex it share common logic for various credits classes. That logic will be replaced soon. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-10-22smb: server: allocate enough space for RW WRs and ib_drain_qp()Stefan Metzmacher1-91/+142
Make use of rdma_rw_mr_factor() to calculate the number of rw credits and the number of pages per RDMA RW operation. We get the same numbers for iWarp connections, tested with siw.ko and irdma.ko (in iWarp mode). siw: CIFS: max_qp_rd_atom=128, max_fast_reg_page_list_len = 256 CIFS: max_sgl_rd=0, max_sge_rd=1 CIFS: responder_resources=32 max_frmr_depth=256 mr_io.type=0 CIFS: max_send_wr 384, device reporting max_cqe 3276800 max_qp_wr 32768 ksmbd: max_fast_reg_page_list_len = 256, max_sgl_rd=0, max_sge_rd=1 ksmbd: device reporting max_cqe 3276800 max_qp_wr 32768 ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256 ksmbd: New sc->rw_io.credits: max = 9, num_pages = 256, maxpages=2048 ksmbd: Info: rdma_send_wr 27 + max_send_wr 256 = 283 irdma (in iWarp mode): CIFS: max_qp_rd_atom=127, max_fast_reg_page_list_len = 262144 CIFS: max_sgl_rd=0, max_sge_rd=13 CIFS: responder_resources=32 max_frmr_depth=2048 mr_io.type=0 CIFS: max_send_wr 384, device reporting max_cqe 1048574 max_qp_wr 4063 ksmbd: max_fast_reg_page_list_len = 262144, max_sgl_rd=0, max_sge_rd=13 ksmbd: device reporting max_cqe 1048574 max_qp_wr 4063 ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256 ksmbd: New sc->rw_io.credits: max = 9, num_pages = 256, maxpages=2048 ksmbd: rdma_send_wr 27 + max_send_wr 256 = 283 This means that we get the different correct numbers for ROCE, tested with rdma_rxe.ko and irdma.ko (in RoCEv2 mode). rxe: CIFS: max_qp_rd_atom=128, max_fast_reg_page_list_len = 512 CIFS: max_sgl_rd=0, max_sge_rd=32 CIFS: responder_resources=32 max_frmr_depth=512 mr_io.type=0 CIFS: max_send_wr 384, device reporting max_cqe 32767 max_qp_wr 1048576 ksmbd: max_fast_reg_page_list_len = 512, max_sgl_rd=0, max_sge_rd=32 ksmbd: device reporting max_cqe 32767 max_qp_wr 1048576 ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256 ksmbd: New sc->rw_io.credits: max = 65, num_pages = 32, maxpages=2048 ksmbd: rdma_send_wr 65 + max_send_wr 256 = 321 irdma (in RoCEv2 mode): CIFS: max_qp_rd_atom=127, max_fast_reg_page_list_len = 262144, CIFS: max_sgl_rd=0, max_sge_rd=13 CIFS: responder_resources=32 max_frmr_depth=2048 mr_io.type=0 CIFS: max_send_wr 384, device reporting max_cqe 1048574 max_qp_wr 4063 ksmbd: max_fast_reg_page_list_len = 262144, max_sgl_rd=0, max_sge_rd=13 ksmbd: device reporting max_cqe 1048574 max_qp_wr 4063 ksmbd: Old sc->rw_io.credits: max = 9, num_pages = 256, ksmbd: New sc->rw_io.credits: max = 159, num_pages = 13, maxpages=2048 ksmbd: rdma_send_wr 159 + max_send_wr 256 = 415 And rely on rdma_rw_init_qp() to setup ib_mr_pool_init() for RW MRs. ib_mr_pool_destroy() will be called by rdma_rw_cleanup_mrs(). It seems the code was implemented before the rdma_rw_* layer was fully established in the kernel. While there also add additional space for ib_drain_qp(). This should make sure ib_post_send() will never fail because the submission queue is full. Fixes: ddbdc861e37c ("ksmbd: smbd: introduce read/write credits for RDMA read/write") Fixes: 4c564f03e23b ("smb: server: make use of common smbdirect_socket") Fixes: 177368b99243 ("smb: server: make use of common smbdirect_socket_parameters") Fixes: 95475d8886bd ("smb: server: make use smbdirect_socket.rw_io.credits") Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>