aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-11-04Merge tag 'for-6.18-rc4-tag' of ↵Linus Torvalds5-2/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix memory leak in qgroup relation ioctl when qgroup levels are invalid - don't write back dirty metadata on filesystem with errors - properly log renamed links - properly mark prealloc extent range beyond inode size as dirty (when no-noles is not enabled) * tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: mark dirty extent range for out of bound prealloc extents btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation btrfs: ensure no dirty metadata is written back for an fs with errors
2025-10-30btrfs: mark dirty extent range for out of bound prealloc extentsaustinchang1-0/+10
In btrfs_fallocate(), when the allocated range overlaps with a prealloc extent and the extent starts after i_size, the range doesn't get marked dirty in file_extent_tree. This results in persisting an incorrect disk_i_size for the inode when not using the no-holes feature. This is reproducible since commit 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree"), then became hidden since commit 3d7db6e8bd22 ("btrfs: don't allocate file extent tree for non regular files") and then visible again after commit 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set"), which fixes the previous commit. The following reproducer triggers the problem: $ cat test.sh MNT=/mnt/test DEV=/dev/vdb mkdir -p $MNT mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file1 fallocate -n -o 1M -l 2M $MNT/file1 umount $MNT mount $DEV $MNT len=$((1 * 1024 * 1024)) fallocate -o 1M -l $len $MNT/file1 du --bytes $MNT/file1 umount $MNT mount $DEV $MNT du --bytes $MNT/file1 umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 2097152 /mnt/test/file1 1048576 /mnt/test/file1 The difference is exactly 1048576 as we assigned. Fix by adding a call to btrfs_inode_set_file_extent_range() in btrfs_fallocate_update_isize(). Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") Signed-off-by: austinchang <austinchang@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new nameFilipe Manana2-1/+3
If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relationShardul Bankar1-1/+3
When btrfs_add_qgroup_relation() is called with invalid qgroup levels (src >= dst), the function returns -EINVAL directly without freeing the preallocated qgroup_list structure passed by the caller. This causes a memory leak because the caller unconditionally sets the pointer to NULL after the call, preventing any cleanup. The issue occurs because the level validation check happens before the mutex is acquired and before any error handling path that would free the prealloc pointer. On this early return, the cleanup code at the 'out' label (which includes kfree(prealloc)) is never reached. In btrfs_ioctl_qgroup_assign(), the code pattern is: prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL); ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc); prealloc = NULL; // Always set to NULL regardless of return value ... kfree(prealloc); // This becomes kfree(NULL), does nothing When the level check fails, 'prealloc' is never freed by either the callee or the caller, resulting in a 64-byte memory leak per failed operation. This can be triggered repeatedly by an unprivileged user with access to a writable btrfs mount, potentially exhausting kernel memory. Fix this by freeing prealloc before the early return, ensuring prealloc is always freed on all error paths. Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-30btrfs: ensure no dirty metadata is written back for an fs with errorsQu Wenruo1-0/+8
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-23Merge tag 'for-6.18-rc2-tag' of ↵Linus Torvalds5-11/+64
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in send, fix duplicated rmdir operations when using extrefs (hardlinks), receive can fail with ENOENT - fixup of error check when reading extent root in ref-verify and damaged roots are allowed by mount option (found by smatch) - fix freeing partially initialized fs info (found by syzkaller) - fix use-after-free when printing ref_tracking status of delayed inodes * tag 'for-6.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree() btrfs: fix delayed_node ref_tracker use after free btrfs: send: fix duplicated rmdir operations when using extrefs btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()
2025-10-22btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()Amit Dhingra1-1/+1
btrfs_extent_root()/btrfs_global_root() does not return error pointers, it returns NULL on error. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/all/aNJfvxj0anEnk9Dm@stanley.mountain/ Fixes : ed4e6b5d644c ("btrfs: ref-verify: handle damaged extent root tree") CC: stable@vger.kernel.org # 6.17+ Signed-off-by: Amit Dhingra <mechanicalamit@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-22btrfs: fix delayed_node ref_tracker use after freeLeo Martins2-1/+8
Move the print before releasing the delayed node. In my initial testing there was a bug that was causing delayed_nodes to not get freed which is why I put the print after the release. This obviously neglects the case where the delayed node is properly freed. Add condition to make sure we only print if we have more than one reference to the delayed_node to prevent printing when we only have the reference taken in btrfs_kill_all_delayed_nodes(). Fixes: b767a28d6154 ("btrfs: print leaked references in kill_all_delayed_nodes()") Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-17btrfs: send: fix duplicated rmdir operations when using extrefsTing-Chang Hou1-8/+48
Commit 29d6d30f5c8a ("Btrfs: send, don't send rmdir for same target multiple times") has fixed an issue that a send stream contained a rmdir operation for the same directory multiple times. After that fix we keep track of the last directory for which we sent a rmdir operation and compare with it before sending a rmdir for the parent inode of a deleted hardlink we are processing. But there is still a corner case that in between rmdir dir operations for the same inode we find deleted hardlinks for other parent inodes, so tracking just the last inode for which we sent a rmdir operation is not enough. Hardlinks of a file in the same directory are stored in the same INODE_REF item, but if the number of hardlinks is too large and can not fit in a leaf, we use INODE_EXTREF items to store them. The key of an INODE_EXTREF item is (inode_id, INODE_EXTREF, hash[name, parent ino]), so between two hardlinks for the same parent directory, we can find others for other parent directories. For example for the reproducer below we get the following (from a btrfs inspect-internal dump-tree output): item 0 key (259 INODE_EXTREF 2309449) itemoff 16257 itemsize 26 index 6925 parent 257 namelen 8 name: foo.6923 item 1 key (259 INODE_EXTREF 2311350) itemoff 16231 itemsize 26 index 6588 parent 258 namelen 8 name: foo.6587 item 2 key (259 INODE_EXTREF 2457395) itemoff 16205 itemsize 26 index 6611 parent 257 namelen 8 name: foo.6609 (...) So tracking the last directory's inode number does not work in this case since we process a link for parent inode 257, then for 258 and then back again for 257, and that second time we process a deleted link for 257 we think we have not yet sent a rmdir operation. Fix this by using a rbtree to keep track of all the directories for which we have already sent rmdir operations, and add those directories to the 'check_dirs' ref list in process_recorded_refs() only if the directory is not yet in the rbtree, otherwise skip it since it means we have already sent a rmdir operation for that directory. The following test script reproduces the problem: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/a $MNT/b echo 123 > $MNT/a/foo for ((i = 1; i <= 1000; i++)); do ln $MNT/a/foo $MNT/a/foo.$i ln $MNT/a/foo $MNT/b/foo.$i done btrfs subvolume snapshot -r $MNT $MNT/snap1 btrfs send $MNT/snap1 -f /tmp/base.send rm -r $MNT/a $MNT/b btrfs subvolume snapshot -r $MNT $MNT/snap2 btrfs send -p $MNT/snap1 $MNT/snap2 -f /tmp/incremental.send umount $MNT mkfs.btrfs -f $DEV mount $DEV $MNT btrfs receive $MNT -f /tmp/base.send btrfs receive $MNT -f /tmp/incremental.send rm -f /tmp/base.send /tmp/incremental.send umount $MNT When running it, it fails like this: $ ./test.sh (...) At subvol snap1 At snapshot snap2 ERROR: rmdir o257-9-0 failed: No such file or directory CC: <stable@vger.kernel.org> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Ting-Chang Hou <tchou@synology.com> [ Updated changelog ] Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-17btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()Dewei Meng1-1/+7
If fs_info->super_copy or fs_info->super_for_commit allocated failed in btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info(). Otherwise btrfs_check_leaked_roots() would access NULL pointer because fs_info->allocated_roots had not been initialised. syzkaller reported the following information: ------------[ cut here ]------------ BUG: unable to handle page fault for address: fffffffffffffbb0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...) RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline] RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline] RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline] RIP: 0010:refcount_read include/linux/refcount.h:170 [inline] RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230 [...] Call Trace: <TASK> btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280 btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029 btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097 vfs_get_tree+0x98/0x320 fs/super.c:1759 do_new_mount+0x357/0x660 fs/namespace.c:3899 path_mount+0x716/0x19c0 fs/namespace.c:4226 do_mount fs/namespace.c:4239 [inline] __do_sys_mount fs/namespace.c:4450 [inline] __se_sys_mount fs/namespace.c:4427 [inline] __x64_sys_mount+0x28c/0x310 fs/namespace.c:4427 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f032eaffa8d [...] Fixes: 3bb17a25bcb0 ("btrfs: add get_tree callback for new mount API") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dewei Meng <mengdewei@cqsoftware.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-16Merge tag 'for-6.18-rc1-tag' of ↵Linus Torvalds9-22/+25
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - in tree-checker fix extref bounds check - reorder send context structure to avoid -Wflex-array-member-not-at-end warning - fix extent readahead length for compressed extents - fix memory leaks on error paths (qgroup assign ioctl, zone loading with raid stripe tree enabled) - fix how device specific mount options are applied, in particular the 'ssd' option will be set unexpectedly - fix tracking of relocation state when tasks are running and cancellation is attempted - adjust assertion condition for folios allocated for scrub - remove incorrect assertion checking for block group when populating free space tree * tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx btrfs: tree-checker: fix bounds check in check_inode_extref() btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST btrfs: fix incorrect readahead expansion length btrfs: do not assert we found block group item when creating free space tree btrfs: do not use folio_test_partial_kmap() in ASSERT()s btrfs: only set the device specific options after devices are opened btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running
2025-10-13btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctxGustavo A. R. Silva1-1/+3
The warning -Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Fix the following warning: fs/btrfs/send.c:181:24: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] and move the declaration of send_ctx::cur_inode_path to the end. Notice that struct fs_path contains a flexible array member inline_buf, but also a padding array and a limit calculated for the usable space of inline_buf (FS_PATH_INLINE_SIZE). It is not the pattern where flexible array is in the middle of a structure and could potentially overwrite other members. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: tree-checker: fix bounds check in check_inode_extref()Dan Carpenter1-1/+1
The parentheses for the unlikely() annotation were put in the wrong place so it means that the condition is basically never true and the bounds checking is skipped. Fixes: aab9458b9f00 ("btrfs: tree-checker: add inode extref checks") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RSTMiquel Sabaté Solà1-1/+1
At the end of btrfs_load_block_group_zone_info() the first thing we do is to ensure that if the mapping type is not a SINGLE one and there is no RAID stripe tree, then we return early with an error. Doing that, though, prevents the code from running the last calls from this function which are about freeing memory allocated during its run. Hence, in this case, instead of returning early, we set the ret value and fall through the rest of the cleanup code. Fixes: 5906333cc4af ("btrfs: zoned: don't skip block group profile checks on conventional zones") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: fix incorrect readahead expansion lengthBoris Burkov1-1/+1
The intent of btrfs_readahead_expand() was to expand to the length of the current compressed extent being read. However, "ram_bytes" is *not* that, in the case where a single physical compressed extent is used for multiple file extents. Consider this case with a large compressed extent C and then later two non-compressed extents N1 and N2 written over C, leaving C1 and C2 pointing to offset/len pairs of C: [ C ] [ N1 ][ C1 ][ N2 ][ C2 ] In such a case, ram_bytes for both C1 and C2 is the full uncompressed length of C. So starting readahead in C1 will expand the readahead past the end of C1, past N2, and into C2. This will then expand readahead again, to C2_start + ram_bytes, way past EOF. First of all, this is totally undesirable, we don't want to read the whole file in arbitrary chunks of the large underlying extent if it happens to exist. Secondly, it results in zeroing the range past the end of C2 up to ram_bytes. This is particularly unpleasant with fs-verity as it can zero and set uptodate pages in the verity virtual space past EOF. This incorrect readahead behavior can lead to verity verification errors, if we iterate in a way that happens to do the wrong readahead. Fix this by using em->len for readahead expansion, not em->ram_bytes, resulting in the expected behavior of stopping readahead at the extent boundary. Reported-by: Max Chernoff <git@maxchernoff.ca> Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898 Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents") CC: stable@vger.kernel.org # 6.17 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: do not assert we found block group item when creating free space treeFilipe Manana1-7/+8
Currently, when building a free space tree at populate_free_space_tree(), if we are not using the block group tree feature, we always expect to find block group items (either extent items or a block group item with key type BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with btrfs_search_slot_for_read(), so we assert that we found an item. However this expectation is wrong since we can have a new block group created in the current transaction which is still empty and for which we still have not added the block group's item to the extent tree, in which case we do not have any items in the extent tree associated to the block group. The insertion of a new block group's block group item in the extent tree happens at btrfs_create_pending_block_groups() when it calls the helper insert_block_group_item(). This typically is done when a transaction handle is released, committed or when running delayed refs (either as part of a transaction commit or when serving tickets for space reservation if we are low on free space). So remove the assertion at populate_free_space_tree() even when the block group tree feature is not enabled and update the comment to mention this case. Syzbot reported this with the following stack trace: BTRFS info (device loop3 state M): rebuilding free space tree assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1115! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115 Code: ff ff e8 d3 (...) RSP: 0018:ffffc9000430f780 EFLAGS: 00010246 RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94 R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001 R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000 FS: 00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0 Call Trace: <TASK> btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364 btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062 btrfs_remount_rw fs/btrfs/super.c:1334 [inline] btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559 reconfigure_super+0x227/0x890 fs/super.c:1076 do_remount fs/namespace.c:3279 [inline] path_mount+0xd1a/0xfe0 fs/namespace.c:4027 do_mount fs/namespace.c:4048 [inline] __do_sys_mount fs/namespace.c:4236 [inline] __se_sys_mount+0x313/0x410 fs/namespace.c:4213 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f424e39066a Code: d8 64 89 02 (...) RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000 RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020 R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380 R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/ Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree") CC: <stable@vger.kernel.org> # 6.1.x: 1961d20f6fa8: btrfs: fix assertion when building free space tree CC: <stable@vger.kernel.org> # 6.1.x Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: do not use folio_test_partial_kmap() in ASSERT()sQu Wenruo1-2/+2
[BUG] Syzbot reported an ASSERT() triggered inside scrub: BTRFS info (device loop0): scrub: started on devid 1 assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697 ------------[ cut here ]------------ kernel BUG at fs/btrfs/scrub.c:697! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697 Call Trace: <TASK> scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline] scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897 submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952 flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973 scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516 scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575 scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839 btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153 btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:597 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> ---[ end trace 0000000000000000 ]--- Which doesn't make much sense, as all the folios we allocated for scrub should not be highmem. [CAUSE] Thankfully syzbot has a detailed kernel config file, showing that CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y. And that debug option will force all folio_test_partial_kmap() to return true, to improve coverage on highmem tests. But in our case we really just want to make sure the folios we allocated are not highmem (and they are indeed not). Such incorrect result from folio_test_partial_kmap() is just screwing up everything. [FIX] Replace folio_test_partial_kmap() to folio_test_highmem() so that we won't bother those highmem specific debuging options. Fixes: 5fbaae4b8567 ("btrfs: prepare scrub to support bs > ps cases") Reported-by: syzbot+bde59221318c592e6346@syzkaller.appspotmail.com Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: only set the device specific options after devices are openedQu Wenruo1-2/+1
[BUG] With v6.17-rc kernels, btrfs will always set 'ssd' mount option even if the block device is not a rotating one: # cat /sys/block/sdd/queue/rotational 1 # cat /etc/fstab: LABEL=DATA2 /data2 btrfs rw,relatime,space_cache=v2,subvolid=5,subvol=/,nofail,nosuid,nodev 0 0 # mount [...] /dev/sdd on /data2 type btrfs (rw,nosuid,nodev,relatime,ssd,space_cache=v2,subvolid=5,subvol=/) [CAUSE] The 'ssd' mount option is set by set_device_specific_options(), and it expects that if there is any rotating device in the btrfs, it will set fs_devices::rotating. However after commit bddf57a70781 ("btrfs: delay btrfs_open_devices() until super block is created"), the device opening is delayed until the super block is created. But the timing of set_device_specific_options() is still left as is, this makes the function be called without any device opened. Since no device is opened, thus fs_devices::rotating will never be set, making btrfs incorrectly set 'ssd' mount option. [FIX] Only call set_device_specific_options() after btrfs_open_devices(). Also only call set_device_specific_options() after a new mount, if we're mounting a mounted btrfs, there is no need to set the device specific mount options again. Reported-by: HAN Yuwei <hrx@bupt.moe> Link: https://lore.kernel.org/linux-btrfs/C8FF75669DFFC3C5+5f93bf8a-80a0-48a6-81bf-4ec890abc99a@bupt.moe/ Fixes: bddf57a70781 ("btrfs: delay btrfs_open_devices() until super block is created") CC: stable@vger.kernel.org # 6.17 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: fix memory leak on duplicated memory in the qgroup assign ioctlMiquel Sabaté Solà1-1/+1
On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as provided by the user, which is kfree'd in the end. But this was not the case when allocating memory for 'prealloc'. In this case, if it somehow failed, then the previous code would go directly into calling 'mnt_drop_write_file', without freeing the string duplicated from the user space. Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already runningFilipe Manana1-6/+7
When starting relocation, at reloc_chunk_start(), if we happen to find the flag BTRFS_FS_RELOC_RUNNING is already set we return an error (-EINPROGRESS) to the callers, however the callers call reloc_chunk_end() which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since relocation was started by another task and still running. Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected scenario, but still our current behaviour is not correct. Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has returned an error, which is what logically makes sense, since the general widespread pattern is to have end functions called only if the counterpart start functions succeeded. This requires changing reloc_chunk_start() to clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request. Fixes: 907d2710d727 ("btrfs: add cancellable chunk relocation support") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-06Merge tag 'for-6.18-tag' of ↵Linus Torvalds2-2/+8
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two short fixes that would be good to have before rc1" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix PAGE_SIZE format specifier in open_ctree() btrfs: avoid potential out-of-bounds in btrfs_encode_fh()
2025-10-03Merge tag 'docs-6.18' of git://git.lwn.net/linuxLinus Torvalds1-1/+1
Pull documentation updates from Jonathan Corbet: "It has been a relatively busy cycle in docsland, with changes all over: - Bring the kernel memory-model docs into the Sphinx build in the "literal include" mode. - Lots of build-infrastructure work, further cleaning up long-term kernel-doc technical debt. The sphinx-pre-install tool has been converted to Python and updated for current systems. - A new tool to detect when documents have been moved and generate HTML redirects; this can be used on kernel.org (or any other site hosting the rendered docs) to avoid breaking links. - Automated processing of the YAML files describing the netlink protocol. - A significant update of the maintainer's PGP guide. ... and a seemingly endless series of typo fixes, build-problem fixes, etc" * tag 'docs-6.18' of git://git.lwn.net/linux: (193 commits) Documentation/features: Update feature lists for 6.17-rc7 docs: remove cdomain.py Documentation/process: submitting-patches: fix typo in "were do" docs: dev-tools/lkmm: Fix typo of missing file extension Documentation: trace: histogram: Convert ftrace docs cross-reference Documentation: trace: histogram-design: Wrap introductory note in note:: directive Documentation: trace: historgram-design: Separate sched_waking histogram section heading and the following diagram Documentation: trace: histogram-design: Trim trailing vertices in diagram explanation text Documentation: trace: histogram: Fix histogram trigger subsection number order docs: driver-api: fix spelling of "buses". Documentation: fbcon: Use admonition directives Documentation: fbcon: Reindent 8th step of attach/detach/unload Documentation: fbcon: Add boot options and attach/detach/unload section headings docs: filesystems: sysfs: add remaining top level sysfs directory descriptions docs: filesystems: sysfs: clarify symlink destinations in dev and bus/devices descriptions docs: filesystems: sysfs: remove top level sysfs net directory docs: maintainer: Fix ambiguous subheading formatting docs: kdoc: a few more dump_typedef() tweaks docs: kdoc: remove redundant comment stripping in dump_typedef() docs: kdoc: remove some dead code in dump_typedef() ...
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-10-01btrfs: fix PAGE_SIZE format specifier in open_ctree()Nathan Chancellor1-1/+1
There is an instance of -Wformat when targeting 32-bit architectures due to using a 'size_t' specifier (which is 'unsigned int' for 32-bit platforms) to print PAGE_SIZE: In file included from fs/btrfs/compression.h:17, from fs/btrfs/extent_io.h:15, from fs/btrfs/locking.h:13, from fs/btrfs/ctree.h:19, from fs/btrfs/disk-io.c:22: fs/btrfs/disk-io.c: In function 'open_ctree': include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=] ... fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn' 3398 | btrfs_warn(fs_info, | ^~~~~~~~~~ PAGE_SIZE is consistently defined as an 'unsigned long' in include/vsdo/page.h so use '%lu' to clear up the warning. Fixes: 98077f7f2180 ("btrfs: enable experimental bs > ps support") Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-30Merge tag 'for-6.18-tag' of ↵Linus Torvalds76-2430/+3420
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are no new features, the changes are in the core code, notably tree-log error handling and reporting improvements, and initial support for block size > page size. Performance improvements: - search data checksums in the commit root (previous transaction) to avoid locking contention, this improves parallelism of read heavy/low write workloads, and also reduces transaction commit time; on real and reproducer workload the sync time went from minutes to tens of seconds (workload and numbers are in the changelog) Core: - tree-log updates: - error handling improvements, transaction aborts - add new error state 'O' (printed in status messages) when log replay fails and is aborted - reduced number of btrfs_path allocations when traversing the tree - 'block size > page size' support - basic implementation with limitations, under experimental build - limitations: no direct io, raid56, encoded read (standalone and in send ioctl), encoded write - preparatory work for compression, removing implicit assumptions of page and block sizes - compression workspaces are now per-filesystem, we cannot assume common block size for work memory among different filesystems - tree-checker now verifies INODE_EXTREF item (which is implementing hardlinks) - tree leaf pretty printer updates, there were missing data from items, keys/items - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG, it's a debugging feature and not needed to be enabled separately - more struct btrfs_path auto free updates - use ref_tracker API for tracking delayed inodes, enabled by mount option 'ref_verify', allowing to better pinpoint leaking references - in zoned mode, avoid selecting data relocation zoned for ordinary data block groups - updated and enhanced error messages - lots of cleanups and refactoring" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits) btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() btrfs: add unlikely annotations to branches leading to transaction abort btrfs: add unlikely annotations to branches leading to EIO btrfs: add unlikely annotations to branches leading to EUCLEAN btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions btrfs: zoned: don't fail mount needlessly due to too many active zones btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() btrfs: enable experimental bs > ps support btrfs: add extra ASSERT()s to catch unaligned bios btrfs: fix symbolic link reading when bs > ps btrfs: prepare scrub to support bs > ps cases btrfs: prepare zlib to support bs > ps cases btrfs: prepare lzo to support bs > ps cases btrfs: prepare zstd to support bs > ps cases btrfs: prepare compression folio alloc/free for bs > ps cases btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() btrfs: remove pointless key offset setup in create_pending_snapshot() btrfs: annotate btrfs_is_testing() as unlikely and make it return bool btrfs: make the rule checking more readable for should_cow_block() btrfs: simplify inline extent end calculation at replay_one_extent() ...
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds5-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.inode' of ↵Linus Torvalds3-1/+11
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-26btrfs: avoid potential out-of-bounds in btrfs_encode_fh()Anderson Nascimento1-1/+7
The function btrfs_encode_fh() does not properly account for the three cases it handles. Before writing to the file handle (fh), the function only returns to the user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes). However, when a parent exists and the root ID of the parent and the inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT (10 dwords, 40 bytes). If *max_len is not large enough, this write goes out of bounds because BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than BTRFS_FID_SIZE_CONNECTABLE originally returned. This results in an 8-byte out-of-bounds write at fid->parent_root_objectid = parent_root_id. A previous attempt to fix this issue was made but was lost. https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/ Although this issue does not seem to be easily triggerable, it is a potential memory corruption bug that should be fixed. This patch resolves the issue by ensuring the function returns the appropriate size for all three cases and validates that *max_len is large enough before writing any data. Fixes: be6e8dc0ba84 ("NFS support for btrfs - v3") CC: stable@vger.kernel.org # 3.0+ Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-24Merge tag 'for-6.17-rc7-tag' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more regression fix for a problem in zoned mode: mounting would fail if the number of open and active zones reached a common limit that didn't use to be checked" * tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: don't fail mount needlessly due to too many active zones
2025-09-23btrfs: zoned: don't fail mount needlessly due to too many active zonesJohannes Thumshirn1-0/+6
Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit 04147d8394e8 the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before 04147d8394e8 because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()Filipe Manana1-1/+1
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a full write barrier, smp_wmb(), but we don't need to, all we need is a smp_mb__after_atomic(). The use of the smp_wmb() is from the old days when we didn't use a bit and used instead an int field in the root to signal if cow is forced. After the int field was changed to a bit in the root's state (flags field), we forgot to update the memory barrier in create_pending_snapshot() to smp_mb__after_atomic(), but we did the change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root"). On the reader side, in should_cow_block(), we also use the counterpart smp_mb__before_atomic() which generates further confusion. So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't even need any barrier at all since create_pending_snapshot() is called in the critical section of a transaction commit and therefore no one can concurrently join/attach the transaction, or start a new one, until the transaction is unblocked. By the time someone starts a new transaction and enters should_cow_block(), a lot of implicit memory barriers already took place by having acquired several locks such as fs_info->trans_lock and extent buffer locks on the root node at least. Nevertlheless, for consistency use smp_mb__after_atomic() after setting the force cow bit in create_pending_snapshot(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to transaction abortDavid Sterba20-232/+231
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EIODavid Sterba19-83/+81
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EUCLEANDavid Sterba19-110/+110
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: more trivial BTRFS_PATH_AUTO_FREE conversionsSun YangKai10-351/+195
Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: zoned: don't fail mount needlessly due to too many active zonesJohannes Thumshirn1-0/+6
Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit 04147d8394e8 the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before 04147d8394e8 because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()Miquel Sabaté Solà2-8/+5
As pointed out in the documentation, calling 'kmalloc' with open-coded arithmetic can lead to unfortunate overflows and this particular way of using it has been deprecated. Instead, it's preferred to use 'kmalloc_array' in cases where it might apply so an overflow check is performed. Note this is an API cleanup and is not fixing any overflows because in all cases the multipliers are bounded small numbers derived from number of items in leaves/nodes. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: enable experimental bs > ps supportQu Wenruo5-15/+58
With all the preparation patches, we're able to finally enable btrfs block size (sector size) larger than page size support and give it a full fstests run. And obviously this new feature is hidden behind experimental flags, and should not be considered as a core feature yet as btrfs' default block size is still 4K. But this is still a feature that will shine in the future where 16K block sized device are widely adopted. For now there are some features explicitly disabled: - Direct IO This is the most complex part to support, the root reason is we can not control the pages of iov iter passed in. User space programs can only ensure the virtual addresses are contiguous, but have no control on their physical addresses. Our bs > ps support heavily relies on large folios, and direct IO memory can easily break it. So direct IO is disabled and will always fall back to buffered IO. - RAID56 In theory we can convert RAID56 to use large folios, but it will need to be converted back to page based if we want to support direct IO in the future. So just reject it for now. - Encoded send - Encoded read Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send is utilizing vmallocated memory. Unfortunately for vmallocated memory we can not guarantee the minimal folio order. For send, it will just always fallback to regular writes, which reads from page cache and will follow the existing folio order requirement. - Encoded write Encoded write itself is allocating pages by themselves, and we can easily change it to follow the minimal order. But since encoded read is already disabled, there is no need to only enable encoded write. Finally just like what we did for bs < ps support in the past, add a warning message for bs > ps mounts. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add extra ASSERT()s to catch unaligned biosQu Wenruo1-0/+27
Btrfs uses btrfs_bio to handle read/write of logical address, for the incoming bs > ps support, btrfs has extra requirements: - One folio must contain at least one fs block - No fs block can cross folio boundaries This requirement is not hard to maintain, thanks to the address space's minimal folio order. But not all btrfs bios are generated through address space, e.g. compression and scrub. To catch possible unaligned bios, introduce a helper, assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio(). This will check the following things: - bv_offset is aligned to block size - bv_len is aligned to block size With a btrfs bio passing above checks, unless it's empty it will ensure the requirements for bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix symbolic link reading when bs > psQu Wenruo1-1/+1
[BUG DURING BS > PS TEST] When running the following script on a btrfs whose block size is larger than page size, e.g. 8K block size and 4K page size, it will trigger a kernel BUG: # mkfs.btrfs -s 8k $dev # mount $dev $mnt # mkdir $mnt/dir # ln -s dir $mnt/link # ls $mnt/link The call trace looks like this: BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS info (device dm-2): checking UUID tree BTRFS info (device dm-2): enabling ssd optimizations BTRFS info (device dm-2): enabling free space tree ------------[ cut here ]------------ kernel BUG at /home/adam/linux/include/linux/highmem.h:275! Oops: invalid opcode: 0000 [#1] SMP CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs] Call Trace: <TASK> btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] filemap_read_folio+0x37/0xe0 do_read_cache_folio+0x94/0x3e0 __page_get_link.isra.0+0x20/0x90 page_get_link+0x16/0x40 step_into+0x69b/0x830 path_lookupat+0xa7/0x170 filename_lookup+0xf7/0x200 ? set_ptes.isra.0+0x36/0x70 vfs_statx+0x7a/0x160 do_statx+0x63/0xa0 __x64_sys_statx+0x90/0xe0 do_syscall_64+0x82/0xae0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Please note bs > ps support is still under development and the enablement patch is not even in btrfs development branch. [CAUSE] Btrfs reuses its data folio read path to handle symbolic links, as the symbolic link target is stored as an inline data extent. But for newly created inodes, btrfs only set the minimal order if the target inode is a regular file. Thus for above newly created symbolic link, it doesn't properly respect the minimal folio order, and triggered the above crash. [FIX] Call btrfs_set_inode_mapping_order() unconditionally inside btrfs_create_new_inode(). For symbolic links this will fix the crash as now the folio will meet the minimal order. For regular files this brings no change. For directory/bdev/char and all the other types of inodes, they won't go through the data read path, thus no effect either. Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare scrub to support bs > ps casesQu Wenruo3-39/+60
This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare zlib to support bs > ps casesQu Wenruo1-15/+32
This involves converting the following functions to use correct folio sizes/shifts: - zlib_compress_folios() - zlib_decompress_bio() There is a special handling for s390 hardware acceleration. With bs > ps cases, we can go with 16K block size on s390 (which uses fixed 4K page size). In that case we do not need to do the buffer copy as our folio is large enough for hardware acceleration. So factor out the s390 specific and folio size check into a helper, need_special_buffer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare lzo to support bs > ps casesQu Wenruo1-19/+22
This involves converting the following functions to use correct folio sizes/shifts: - copy_compress_data_to_page() - lzo_compress_folios() - lzo_decompress_bio() Just like zstd, lzo has some extra incorrect usage of kmap_local_folio() that the offset is always 0. This will not handle HIGHMEM large folios correctly, but those cases are already rejected explicitly so it should not cause problems when bs > ps support is enabled. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare zstd to support bs > ps casesQu Wenruo2-14/+34
This involves converting the following functions to use proper folio sizes/shifts: - zstd_compress_folios() - zstd_decompress_bio() The function zstd_decompress() is already using block size correctly without using page size, thus it needs no modification. And since zstd compression is calling kmap_local_folio(), the existing code cannot handle large folios with HIGHMEM, as kmap_local_folio() requires us to handle one page range each time. I do not really think it's worth to spend time on some feature that will be deprecated eventually. So here just add an extra explicit rejection for bs > ps with HIGHMEM feature enabled kernels. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare compression folio alloc/free for bs > ps casesQu Wenruo9-44/+78
This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()Qu Wenruo1-3/+11
[BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector(): ASSERT(block_start != EXTENT_MAP_HOLE); Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch. But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support. [CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it. Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size) After some more debugging, the ASSERT() is trigger by the following sequence: extent_writepage() | We got a 32K folio (4 fs blocks) at file offset 0, and the fs block | size is 8K, page size is 4K. | And there is another 8K folio at file offset 32K, which is also | dirty. | So the filemap layout looks like the following: | | "||" is the filio boundary in the filemap. | "//| is the dirty range. | | 0 8K 16K 24K 32K 40K | |////////| |//////////////////////||////////| | |- writepage_delalloc() | |- find_lock_delalloc_range() for [0, 8K) | | Now range [0, 8K) is properly locked. | | | |- find_lock_delalloc_range() for [16K, 40K) | | |- btrfs_find_delalloc_range() returned range [16K, 40K) | | |- lock_delalloc_folios() locked folio 0 successfully | | | | | | The filemap range [32K, 40K) got dropped from filemap. | | | | | |- lock_delalloc_folios() failed with -EAGAIN on folio 32K | | | As the folio at 32K is dropped. | | | | | |- loops = 1; | | |- max_bytes = PAGE_SIZE; | | |- goto again; | | | This will re-do the lookup for dirty delalloc ranges. | | | | | |- btrfs_find_delalloc_range() called with @max_bytes == 4K | | | This is smaller than block size, so | | | btrfs_find_delalloc_range() is unable to return any range. | | \- return false; | | | \- Now only range [0, 8K) has an OE for it, but for dirty range | [16K, 32K) it's dirty without an OE. | This breaks the assumption that writepage_delalloc() will find | and lock all dirty ranges inside the folio. | |- extent_writepage_io() |- submit_one_sector() for [0, 8K) | Succeeded | |- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole. Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size. If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio. [FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio. And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: remove pointless key offset setup in create_pending_snapshot()Filipe Manana1-4/+2
There's no point in setting the key's offset to (u64)-1 since we never use it before setting it to the current transaction's ID. So remove the assignment of (u64)-1 to the key's offset and move the remainder of the key initialization close to where it's used. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: annotate btrfs_is_testing() as unlikely and make it return boolFilipe Manana3-10/+8
We can annotate btrfs_is_testing() as unlikely since that's the most expected scenario and it's desirable for the compiler to optimize for the case we are not running the self tests. So add the annotation to btrfs_is_testing() and while at it also make it return bool instead of int. Also make two of the existing callers use btrfs_is_testing() directly instead of storing its result in a local variable. On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object code reduction. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>