| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix memory leak in qgroup relation ioctl when qgroup levels are
invalid
- don't write back dirty metadata on filesystem with errors
- properly log renamed links
- properly mark prealloc extent range beyond inode size as dirty (when
no-noles is not enabled)
* tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: mark dirty extent range for out of bound prealloc extents
btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name
btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation
btrfs: ensure no dirty metadata is written back for an fs with errors
|
|
If we are logging a new name make sure our inode has the runtime flag
BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find
new inode refs/extrefs in the subvolume tree and copy them into the log
tree.
We are currently doing it when adding a new link but we are missing it
when renaming.
An example where this makes a new name not persisted:
1) create symlink with name foo in directory A
2) fsync directory A, which persists the symlink
3) rename the symlink from foo to bar
4) fsync directory A to persist the new symlink name
Step 4 isn't working correctly as it's not logging the new name and also
leaving the old inode ref in the log tree, so after a power failure the
symlink still has the old name of "foo". This is because when we first
fsync directoy A we log the symlink's inode (as it's a new entry) and at
btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because
we are using that mode and the inode has the runtime flag
BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag
BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode,
during the rename through the call to btrfs_log_new_name() (calling
btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search
the subvolume tree for new refs/extrefs and jump directory to the
'log_extents' label.
Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode
when we are about to log a new name. A test case for fstests will follow
soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "mm, swap: improve cluster scan strategy" from Kairui Song improves
performance and reduces the failure rate of swap cluster allocation
- "support large align and nid in Rust allocators" from Vitaly Wool
permits Rust allocators to set NUMA node and large alignment when
perforning slub and vmalloc reallocs
- "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend
DAMOS_STAT's handling of the DAMON operations sets for virtual
address spaces for ops-level DAMOS filters
- "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren
Baghdasaryan reduces mmap_lock contention during reads of
/proc/pid/maps
- "mm/mincore: minor clean up for swap cache checking" from Kairui Song
performs some cleanup in the swap code
- "mm: vm_normal_page*() improvements" from David Hildenbrand provides
code cleanup in the pagemap code
- "add persistent huge zero folio support" from Pankaj Raghav provides
a block layer speedup by optionalls making the
huge_zero_pagepersistent, instead of releasing it when its refcount
falls to zero
- "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to
the recently added Kexec Handover feature
- "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo
Stoakes turns mm_struct.flags into a bitmap. To end the constant
struggle with space shortage on 32-bit conflicting with 64-bit's
needs
- "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap
code
- "selftests/mm: Fix false positives and skip unsupported tests" from
Donet Tom fixes a few things in our selftests code
- "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised"
from David Hildenbrand "allows individual processes to opt-out of
THP=always into THP=madvise, without affecting other workloads on the
system".
It's a long story - the [1/N] changelog spells out the considerations
- "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on
the memdesc project. Please see
https://kernelnewbies.org/MatthewWilcox/Memdescs and
https://blogs.oracle.com/linux/post/introducing-memdesc
- "Tiny optimization for large read operations" from Chi Zhiling
improves the efficiency of the pagecache read path
- "Better split_huge_page_test result check" from Zi Yan improves our
folio splitting selftest code
- "test that rmap behaves as expected" from Wei Yang adds some rmap
selftests
- "remove write_cache_pages()" from Christoph Hellwig removes that
function and converts its two remaining callers
- "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD
selftests issues
- "introduce kernel file mapped folios" from Boris Burkov introduces
the concept of "kernel file pages". Using these permits btrfs to
account its metadata pages to the root cgroup, rather than to the
cgroups of random inappropriate tasks
- "mm/pageblock: improve readability of some pageblock handling" from
Wei Yang provides some readability improvements to the page allocator
code
- "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON
to understand arm32 highmem
- "tools: testing: Use existing atomic.h for vma/maple tests" from
Brendan Jackman performs some code cleanups and deduplication under
tools/testing/
- "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes
a couple of 32-bit issues in tools/testing/radix-tree.c
- "kasan: unify kasan_enabled() and remove arch-specific
implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific
initialization code into a common arch-neutral implementation
- "mm: remove zpool" from Johannes Weiner removes zspool - an
indirection layer which now only redirects to a single thing
(zsmalloc)
- "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a
couple of cleanups in the fork code
- "mm: remove nth_page()" from David Hildenbrand makes rather a lot of
adjustments at various nth_page() callsites, eventually permitting
the removal of that undesirable helper function
- "introduce kasan.write_only option in hw-tags" from Yeoreum Yun
creates a KASAN read-only mode for ARM, using that architecture's
memory tagging feature. It is felt that a read-only mode KASAN is
suitable for use in production systems rather than debug-only
- "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does
some tidying in the hugetlb folio allocation code
- "mm: establish const-correctness for pointer parameters" from Max
Kellermann makes quite a number of the MM API functions more accurate
about the constness of their arguments. This was getting in the way
of subsystems (in this case CEPH) when they attempt to improving
their own const/non-const accuracy
- "Cleanup free_pages() misuse" from Vishal Moola fixes a number of
code sites which were confused over when to use free_pages() vs
__free_pages()
- "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the
mapletree code accessible to Rust. Required by nouveau and by its
forthcoming successor: the new Rust Nova driver
- "selftests/mm: split_huge_page_test: split_pte_mapped_thp
improvements" from David Hildenbrand adds a fix and some cleanups to
the thp selftesting code
- "mm, swap: introduce swap table as swap cache (phase I)" from Chris
Li and Kairui Song is the first step along the path to implementing
"swap tables" - a new approach to swap allocation and state tracking
which is expected to yield speed and space improvements. This
patchset itself yields a 5-20% performance benefit in some situations
- "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc
layer to clean up the ptdesc code a little
- "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some
issues in our 5-level pagetable selftesting code
- "Minor fixes for memory allocation profiling" from Suren Baghdasaryan
addresses a couple of minor issues in relatively new memory
allocation profiling feature
- "Small cleanups" from Matthew Wilcox has a few cleanups in
preparation for more memdesc work
- "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from
Quanmin Yan makes some changes to DAMON in furtherance of supporting
arm highmem
- "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad
Anjum adds that compiler check to selftests code and fixes the
fallout, by removing dead code
- "Improvements to Victim Process Thawing and OOM Reaper Traversal
Order" from zhongjinji makes a number of improvements in the OOM
killer: mainly thawing a more appropriate group of victim threads so
they can release resources
- "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park
is a bunch of small and unrelated fixups for DAMON
- "mm/damon: define and use DAMON initialization check function" from
SeongJae Park implement reliability and maintainability improvements
to a recently-added bug fix
- "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from
SeongJae Park provides additional transparency to userspace clients
of the DAMON_STAT information
- "Expand scope of khugepaged anonymous collapse" from Dev Jain removes
some constraints on khubepaged's collapsing of anon VMAs. It also
increases the success rate of MADV_COLLAPSE against an anon vma
- "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()"
from Lorenzo Stoakes moves us further towards removal of
file_operations.mmap(). This patchset concentrates upon clearing up
the treatment of stacked filesystems
- "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau
provides some fixes and improvements to mlock's tracking of large
folios. /proc/meminfo's "Mlocked" field became more accurate
- "mm/ksm: Fix incorrect accounting of KSM counters during fork" from
Donet Tom fixes several user-visible KSM stats inaccuracies across
forks and adds selftest code to verify these counters
- "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses
some potential but presently benign issues in KSM's mm_slot handling
* tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits)
mm: swap: check for stable address space before operating on the VMA
mm: convert folio_page() back to a macro
mm/khugepaged: use start_addr/addr for improved readability
hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list
alloc_tag: fix boot failure due to NULL pointer dereference
mm: silence data-race in update_hiwater_rss
mm/memory-failure: don't select MEMORY_ISOLATION
mm/khugepaged: remove definition of struct khugepaged_mm_slot
mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL
hugetlb: increase number of reserving hugepages via cmdline
selftests/mm: add fork inheritance test for ksm_merging_pages counter
mm/ksm: fix incorrect KSM counter handling in mm_struct during fork
drivers/base/node: fix double free in register_one_node()
mm: remove PMD alignment constraint in execmem_vmalloc()
mm/memory_hotplug: fix typo 'esecially' -> 'especially'
mm/rmap: improve mlock tracking for large folios
mm/filemap: map entire large folio faultaround
mm/fault: try to map the entire file folio in finish_fault()
mm/rmap: mlock large folios in try_to_unmap_one()
mm/rmap: fix a mlock race condition in folio_referenced_one()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no new features, the changes are in the core code, notably
tree-log error handling and reporting improvements, and initial
support for block size > page size.
Performance improvements:
- search data checksums in the commit root (previous transaction) to
avoid locking contention, this improves parallelism of read
heavy/low write workloads, and also reduces transaction commit
time; on real and reproducer workload the sync time went from
minutes to tens of seconds (workload and numbers are in the
changelog)
Core:
- tree-log updates:
- error handling improvements, transaction aborts
- add new error state 'O' (printed in status messages) when log
replay fails and is aborted
- reduced number of btrfs_path allocations when traversing the
tree
- 'block size > page size' support
- basic implementation with limitations, under experimental build
- limitations: no direct io, raid56, encoded read (standalone and
in send ioctl), encoded write
- preparatory work for compression, removing implicit assumptions
of page and block sizes
- compression workspaces are now per-filesystem, we cannot assume
common block size for work memory among different filesystems
- tree-checker now verifies INODE_EXTREF item (which is implementing
hardlinks)
- tree leaf pretty printer updates, there were missing data from
items, keys/items
- move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
it's a debugging feature and not needed to be enabled separately
- more struct btrfs_path auto free updates
- use ref_tracker API for tracking delayed inodes, enabled by mount
option 'ref_verify', allowing to better pinpoint leaking references
- in zoned mode, avoid selecting data relocation zoned for ordinary
data block groups
- updated and enhanced error messages
- lots of cleanups and refactoring"
* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
btrfs: add unlikely annotations to branches leading to transaction abort
btrfs: add unlikely annotations to branches leading to EIO
btrfs: add unlikely annotations to branches leading to EUCLEAN
btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
btrfs: zoned: don't fail mount needlessly due to too many active zones
btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
btrfs: enable experimental bs > ps support
btrfs: add extra ASSERT()s to catch unaligned bios
btrfs: fix symbolic link reading when bs > ps
btrfs: prepare scrub to support bs > ps cases
btrfs: prepare zlib to support bs > ps cases
btrfs: prepare lzo to support bs > ps cases
btrfs: prepare zstd to support bs > ps cases
btrfs: prepare compression folio alloc/free for bs > ps cases
btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
btrfs: remove pointless key offset setup in create_pending_snapshot()
btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
btrfs: make the rule checking more readable for should_cow_block()
btrfs: simplify inline extent end calculation at replay_one_extent()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"This contains a series I originally wrote and that Eric brought over
the finish line. It moves out the i_crypt_info and i_verity_info
pointers out of 'struct inode' and into the fs-specific part of the
inode.
So now the few filesytems that actually make use of this pay the price
in their own private inode storage instead of forcing it upon every
user of struct inode.
The pointer for the crypt and verity info is simply found by storing
an offset to its address in struct fsverity_operations and struct
fscrypt_operations. This shrinks struct inode by 16 bytes.
I hope to move a lot more out of it in the future so that struct inode
becomes really just about very core stuff that we need, much like
struct dentry and struct file, instead of the dumping ground it has
become over the years.
On top of this are a various changes associated with the ongoing inode
lifetime handling rework that multiple people are pushing forward:
- Stop accessing inode->i_count directly in f2fs and gfs2. They
simply should use the __iget() and iput() helpers
- Make the i_state flags an enum
- Rework the iput() logic
Currently, if we are the last iput, and we have the I_DIRTY_TIME
bit set, we will grab a reference on the inode again and then mark
it dirty and then redo the put. This is to make sure we delay the
time update for as long as possible
We can rework this logic to simply dec i_count if it is not 1, and
if it is do the time update while still holding the i_count
reference
Then we can replace the atomic_dec_and_lock with locking the
->i_lock and doing atomic_dec_and_test, since we did the
atomic_add_unless above
- Add an icount_read() helper and convert everyone that accesses
inode->i_count directly for this purpose to use the helper
- Expand dump_inode() to dump more information about an inode helping
in debugging
- Add some might_sleep() annotations to iput() and associated
helpers"
* tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: add might_sleep() annotation to iput() and more
fs: expand dump_inode()
inode: fix whitespace issues
fs: add an icount_read helper
fs: rework iput logic
fs: make the i_state flags an enum
fs: stop accessing ->i_count directly in f2fs and gfs2
fsverity: check IS_VERITY() in fsverity_cleanup_inode()
fs: remove inode::i_verity_info
btrfs: move verity info pointer to fs-specific part of inode
f2fs: move verity info pointer to fs-specific part of inode
ext4: move verity info pointer to fs-specific part of inode
fsverity: add support for info in fs-specific part of inode
fs: remove inode::i_crypt_info
ceph: move crypt info pointer to fs-specific part of inode
ubifs: move crypt info pointer to fs-specific part of inode
f2fs: move crypt info pointer to fs-specific part of inode
ext4: move crypt info pointer to fs-specific part of inode
fscrypt: add support for info in fs-specific part of inode
fscrypt: replace raw loads of info pointer with helper function
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual selections of misc updates for this cycle.
Features:
- Add "initramfs_options" parameter to set initramfs mount options.
This allows to add specific mount options to the rootfs to e.g.,
limit the memory size
- Add RWF_NOSIGNAL flag for pwritev2()
Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE
signal from being raised when writing on disconnected pipes or
sockets. The flag is handled directly by the pipe filesystem and
converted to the existing MSG_NOSIGNAL flag for sockets
- Allow to pass pid namespace as procfs mount option
Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs
mount is auto-selected based on the mounting process's active
pidns, and the pidns itself is basically hidden once the mount has
been constructed)
This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns
of a procfs mount as desired. Examples include:
* In order to bypass the mnt_too_revealing() check, Kubernetes
creates a procfs mount from an empty pidns so that user
namespaced containers can be nested (without this, the nested
containers would fail to mount procfs)
But this requires forking off a helper process because you cannot
just one-shot this using mount(2)
* Container runtimes in general need to fork into a container
before configuring its mounts, which can lead to security issues
in the case of shared-pidns containers (a privileged process in
the pidns can interact with your container runtime process)
While SUID_DUMP_DISABLE and user namespaces make this less of an
issue, the strict need for this due to a minor uAPI wart is kind
of unfortunate
Things would be much easier if there was a way for userspace to
just specify the pidns they want. So this pull request contains
changes to implement a new "pidns" argument which can be set
using fsconfig(2):
fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
or classic mount(2) / mount(8):
// mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
Cleanups:
- Remove the last references to EXPORT_OP_ASYNC_LOCK
- Make file_remove_privs_flags() static
- Remove redundant __GFP_NOWARN when GFP_NOWAIT is used
- Use try_cmpxchg() in start_dir_add()
- Use try_cmpxchg() in sb_init_done_wq()
- Replace offsetof() with struct_size() in ioctl_file_dedupe_range()
- Remove vfs_ioctl() export
- Replace rwlock() with spinlock in epoll code as rwlock causes
priority inversion on preempt rt kernels
- Make ns_entries in fs/proc/namespaces const
- Use a switch() statement() in init_special_inode() just like we do
in may_open()
- Use struct_size() in dir_add() in the initramfs code
- Use str_plural() in rd_load_image()
- Replace strcpy() with strscpy() in find_link()
- Rename generic_delete_inode() to inode_just_drop() and
generic_drop_inode() to inode_generic_drop()
- Remove unused arguments from fcntl_{g,s}et_rw_hint()
Fixes:
- Document @name parameter for name_contains_dotdot() helper
- Fix spelling mistake
- Always return zero from replace_fd() instead of the file descriptor
number
- Limit the size for copy_file_range() in compat mode to prevent a
signed overflow
- Fix debugfs mount options not being applied
- Verify the inode mode when loading it from disk in minixfs
- Verify the inode mode when loading it from disk in cramfs
- Don't trigger automounts with RESOLVE_NO_XDEV
If openat2() was called with RESOLVE_NO_XDEV it didn't traverse
through automounts, but could still trigger them
- Add FL_RECLAIM flag to show_fl_flags() macro so it appears in
tracepoints
- Fix unused variable warning in rd_load_image() on s390
- Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD
- Use ns_capable_noaudit() when determining net sysctl permissions
- Don't call path_put() under namespace semaphore in listmount() and
statmount()"
* tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits)
fcntl: trim arguments
listmount: don't call path_put() under namespace semaphore
statmount: don't call path_put() under namespace semaphore
pid: use ns_capable_noaudit() when determining net sysctl permissions
fs: rename generic_delete_inode() and generic_drop_inode()
init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD
initramfs: Replace strcpy() with strscpy() in find_link()
initrd: Use str_plural() in rd_load_image()
initramfs: Use struct_size() helper to improve dir_add()
initrd: Fix unused variable warning in rd_load_image() on s390
fs: use the switch statement in init_special_inode()
fs/proc/namespaces: make ns_entries const
filelock: add FL_RECLAIM to show_fl_flags() macro
eventpoll: Replace rwlock with spinlock
selftests/proc: add tests for new pidns APIs
procfs: add "pidns" mount option
pidns: move is-ancestor logic to helper
openat2: don't trigger automounts with RESOLVE_NO_XDEV
namei: move cross-device check to __traverse_mounts
namei: remove LOOKUP_NO_XDEV check from handle_mounts
...
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:
# mkfs.btrfs -s 8k $dev
# mount $dev $mnt
# mkdir $mnt/dir
# ln -s dir $mnt/link
# ls $mnt/link
The call trace looks like this:
BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): enabling ssd optimizations
BTRFS info (device dm-2): enabling free space tree
------------[ cut here ]------------
kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
Call Trace:
<TASK>
btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
filemap_read_folio+0x37/0xe0
do_read_cache_folio+0x94/0x3e0
__page_get_link.isra.0+0x20/0x90
page_get_link+0x16/0x40
step_into+0x69b/0x830
path_lookupat+0xa7/0x170
filename_lookup+0xf7/0x200
? set_ptes.isra.0+0x36/0x70
vfs_statx+0x7a/0x160
do_statx+0x63/0xa0
__x64_sys_statx+0x90/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.
[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.
But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.
Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.
[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().
For symbolic links this will fix the crash as now the folio will meet
the minimal order.
For regular files this brings no change.
For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.
Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we want to iterate a bio in block unit, we do something
like this:
while (iter->bi_size) {
struct bio_vec bv = bio_iter_iovec();
/* Do something with using the bv */
bio_advance_iter_single(&bbio->bio, iter, sectorsize);
}
That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.
This means the code using that bv can only handle a block if bs <= ps.
To address this problem and handle future bs > ps cases better:
- Introduce a helper btrfs_bio_for_each_block()
Instead of bio_vec, which has single and multiple page version and
multiple page version has quite some limits, use my favorite way to
represent a block, phys_addr_t.
For bs <= ps cases, nothing is changed, except we will do a very
small overhead to convert phys_addr_t to a folio, then use the proper
folio helpers to handle the possible highmem cases.
For bs > ps cases, all blocks will be backed by large folios, meaning
every folio will cover at least one block. And still use proper folio
helpers to handle highmem cases.
With phys_addr_t, we will handle both large folio and highmem
properly. So there is no better single variable to present a btrfs
block than phys_addr_t.
- Extract the data block csum calculation into a helper
The new helper, btrfs_calculate_block_csum() will be utilized by
btrfs_csum_one_bio().
- Use btrfs_bio_for_each_block() to replace existing call sites
Including:
* index_one_bio() from raid56.c
Very straight-forward.
* btrfs_check_read_bio()
Also update repair_one_sector() to grab the folio using phys_addr_t,
and do extra checks to make sure the folio covers at least one
block.
We do not need to bother bv_len at all now.
* btrfs_csum_one_bio()
Now we can move the highmem handling into a dedicated helper,
calculate_block_csum(), and use btrfs_bio_for_each_block() helper.
There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently for btrfs checksum verification, we do it in the following
pattern:
kaddr = kmap_local_*();
ret = btrfs_check_csum_csum(kaddr);
kunmap_local(kaddr);
It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.
In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.
Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:
- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
To follow the more common term "block" used in all other major
filesystems.
- Pass a physical address into btrfs_check_block_csum() and
btrfs_data_csum_ok()
The physical address is always available even for a highmem page.
Since it's page frame number << PAGE_SHIFT + offset in page.
And with that physical address, we can grab the folio covering the
page, and do extra checks to ensure it covers at least one block.
This also allows us to do the kmap inside btrfs_check_block_csum().
This means all the extra HIGHMEM handling will be concentrated into
btrfs_check_block_csum(), and no callers will need to bother highmem
by themselves.
- Properly zero out the block if csum mismatch
Since btrfs_data_csum_ok() only got a paddr, we can not and should not
use memzero_bvec(), which only accepts single page bvec.
Instead use paddr to grab the folio and call folio_zero_range()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a very low chance that DEBUG_WARN() inside
btrfs_writepage_cow_fixup() can be triggered when
CONFIG_BTRFS_EXPERIMENTAL is enabled.
This only happens after run_delalloc_nocow() failed.
Unfortunately I haven't hit it for a while thus no real world dmesg for
now.
[CAUSE]
There is a race window where after run_delalloc_nocow() failed, error
handling can race with writeback thread.
Before we hit run_delalloc_nocow(), there is an inode with the following
dirty pages: (4K page size, 4K block size, no large folio)
0 4K 8K 12K 16K
|/////////|///////////|///////////|////////////|
The inode also have NODATACOW flag, and the above dirty range will go
through different extents during run_delalloc_range():
0 4K 8K 12K 16K
| NOCOW | COW | COW | NOCOW |
The race happen like this:
writeback thread A | writeback thread B
----------------------------------+--------------------------------------
Writeback for folio 0 |
run_delalloc_nocow() |
|- nocow_one_range() |
| For range [0, 4K), ret = 0 |
| |
|- fallback_to_cow() |
| For range [4K, 8K), ret = 0 |
| Folio 4K *UNLOCKED* |
| | Writeback for folio 4K
|- fallback_to_cow() | extent_writepage()
| For range [8K, 12K), failure | |- writepage_delalloc()
| | |
|- btrfs_cleanup_ordered_extents()| |
|- btrfs_folio_clear_ordered() | |
| Folio 0 still locked, safe | |
| | | Ordered extent already allocated.
| | | Nothing to do.
| | |- extent_writepage_io()
| | |- btrfs_writepage_cow_fixup()
|- btrfs_folio_clear_ordered() | |
Folio 4K hold by thread B, | |
UNSAFE! | |- btrfs_test_ordered()
| | Cleared by thread A,
| |
| |- DEBUG_WARN();
This is only possible after run_delalloc_nocow() failure, as
cow_file_range() will keep all folios and io tree range locked, until
everything is finished or after error handling.
The root cause is we allow fallback_to_cow() and nocow_one_range() to
unlock the folios after a successful run, so that during error handling
we're no longer safe to use btrfs_cleanup_ordered_extents() as the
folios are already unlocked.
[FIX]
- Make fallback_to_cow() and nocow_one_range() to keep folios locked
after a successful run
For fallback_to_cow() we can pass COW_FILE_RANGE_KEEP_LOCKED flag
into cow_file_range().
For nocow_one_range() we have to remove the PAGE_UNLOCK flag from
extent_clear_unlock_delalloc().
- Unlock folios if everything is fine in run_delalloc_nocow()
- Use extent_clear_unlock_delalloc() to handle range [@start,
@cur_offset) inside run_delalloc_nocow()
Since folios are still locked, we do not need
cleanup_dirty_folios() to do the cleanup.
extent_clear_unlock_delalloc() with "PAGE_START_WRITEBACK |
PAGE_END_WRITEBACK" will clear the dirty flags.
- Remove cleanup_dirty_folios()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we hit an error inside nocow_one_range(), we do not clear
the page dirty, and let the caller to handle it.
This is very different compared to fallback_to_cow(), when that function
failed, everything will be cleaned up by cow_file_range().
Enhance the situation by:
- Use a common error handling for nocow_one_range()
If we failed anything, use the same btrfs_cleanup_ordered_extents()
and extent_clear_unlock_delalloc().
btrfs_cleanup_ordered_extents() is safe even if we haven't created new
ordered extent, in that case there should be no OE and that function
will do nothing.
The same applies to extent_clear_unlock_delalloc(), and since we're
passing PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK, it
will also clear folio dirty flag during error handling.
- Avoid touching the failed range of nocow_one_range()
As the failed range will be cleaned up and unlocked by that function.
Here we introduce a new variable @nocow_end to record the failed range,
so that we can skip it during the error handling of run_delalloc_nocow().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When running emulated write error tests like generic/475, we can hit
error messages like this:
BTRFS error (device dm-12 state EA): run_delalloc_nocow failed, root=596 inode=264 start=1605632 len=73728: -5
BTRFS error (device dm-12 state EA): failed to run delalloc range, root=596 ino=264 folio=1605632 submit_bitmap=0-7 start=1605632 len=73728: -5
Which is normally buried by direct IO error messages.
However above error messages are not enough to determine which is the
real range that caused the error.
Considering we can have multiple different extents in one delalloc
range (e.g. some COW extents along with some NOCOW extents), just
outputting the error at the end of run_delalloc_nocow() is not enough.
To enhance the error messages:
- Remove the rate limit on the existing error messages
In the generic/475 example, most error messages are from direct IO,
not really from the delalloc range.
Considering how useful the delalloc range error messages are, we don't
want they to be rate limited.
- Add extra @cur_offset output for cow_file_range()
- Add extra variable output for run_delalloc_nocow()
This is especially important for run_delalloc_nocow(), as there
are extra error paths where we can hit error without into
nocow_one_range() nor fallback_to_cow().
- Add an error message for nocow_one_range()
That's the missing part.
For fallback_to_cow(), we have error message from cow_file_range()
already.
- Constify the @len and @end local variables for nocow_one_range()
This makes it much easier to make sure @len and @end are not modified
at runtime.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the error handling of run_delalloc_nocow() modifies
@cur_offset to handle different parts of the delalloc range.
However the error handling can always be split into 3 parts:
1) The range with ordered extents allocated (OE cleanup)
We have to cleanup the ordered extents and unlock the folios.
2) The range that have been cleaned up (skip)
For now it's only for the range of fallback_to_cow().
We should not touch it at all.
3) The range that is not yet touched (untouched)
We have to unlock the folios and clear any reserved space.
This 3 ranges split has the same principle as cow_file_range(), however
the NOCOW/COW handling makes the above 3 range split much more complex:
a) Failure immediately after a successful OE allocation
Thus no @cow_start nor @cow_end set.
start cur_offset end
| OE cleanup | untouched |
b) Failure after hitting a COW range but before calling
fallback_to_cow()
start cow_start cur_offset end
| OE Cleanup | untouched |
c) Failure to call fallback_to_cow()
start cow_start cow_end end
| OE Cleanup | skip | untouched |
Instead of modifying @cur_offset, do proper range calculation for
OE-cleanup and untouched ranges using above 3 cases with proper range
charts.
This avoid updating @cur_offset, as it will an extra output for debug
purposes later, and explain the behavior easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping. Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For the 3 supported compression algorithms, two of them (zstd and zlib)
are already grabbing the btrfs inode for error messages.
It's more common to pass btrfs_inode and grab the address space from it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Now that btrfs_zone_finish_endio_workfn() is directly calling
do_zone_finish() the only caller of btrfs_zone_finish_endio() is
btrfs_finish_one_ordered().
btrfs_finish_one_ordered() already has error handling in-place so
btrfs_zone_finish_endio() can return an error if the block group lookup
fails.
Also as btrfs_zone_finish_endio() already checks for zoned filesystems and
returns early, there's no need to do this in the caller.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function cow_file_range() has two boolean parameters. Replace it
with a single @flags parameter, with two flags:
- COW_FILE_RANGE_NO_INLINE
- COW_FILE_RANGE_KEEP_LOCKED
And since we're here, also update the comments of cow_file_range() to
replace the old "page" usage with "folio".
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
btrfs_init_file_extent_tree() uses S_ISREG() to determine if the file is
a regular file. In the beginning of btrfs_read_locked_inode(), the i_mode
hasn't been read from inode item, then file_extent_tree won't be used at
all in volumes without NO_HOLES.
Fix this by calling btrfs_init_file_extent_tree() after i_mode is
initialized in btrfs_read_locked_inode().
Fixes: 3d7db6e8bd22e6 ("btrfs: don't allocate file extent tree for non regular files")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: austinchang <austinchang@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At this point MIGRATEPAGE_SUCCESS is misnamed for all folio users,
and now that we remove MIGRATEPAGE_UNMAP, it's really the only "success"
return value that the code uses and expects.
Let's just get rid of MIGRATEPAGE_SUCCESS completely and just use "0"
for success.
Link: https://lkml.kernel.org/r/20250811143949.1117439-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com> [mm]
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> [jfs]
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Byungchul Park <byungchul@sk.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Eugenio Pé rez <eperezma@redhat.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is a race condition between inode eviction and inode caching that
can cause a live struct btrfs_inode to be missing from the root->inodes
xarray. Specifically, there is a window during evict() between the inode
being unhashed and deleted from the xarray. If btrfs_iget() is called
for the same inode in that window, it will be recreated and inserted
into the xarray, but then eviction will delete the new entry, leaving
nothing in the xarray:
Thread 1 Thread 2
---------------------------------------------------------------
evict()
remove_inode_hash()
btrfs_iget_path()
btrfs_iget_locked()
btrfs_read_locked_inode()
btrfs_add_inode_to_root()
destroy_inode()
btrfs_destroy_inode()
btrfs_del_inode_from_root()
__xa_erase
In turn, this can cause issues for subvolume deletion. Specifically, if
an inode is in this lost state, and all other inodes are evicted, then
btrfs_del_inode_from_root() will call btrfs_add_dead_root() prematurely.
If the lost inode has a delayed_node attached to it, then when
btrfs_clean_one_deleted_snapshot() calls btrfs_kill_all_delayed_nodes(),
it will loop forever because the delayed_nodes xarray will never become
empty (unless memory pressure forces the inode out). We saw this
manifest as soft lockups in production.
Fix it by only deleting the xarray entry if it matches the given inode
(using __xa_cmpxchg()).
Fixes: 310b2f5d5a94 ("btrfs: use an xarray to track open inodes in a root")
Cc: stable@vger.kernel.org # 6.11+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Co-authored-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of doing direct access to ->i_count, add a helper to handle
this. This will make it easier to convert i_count to a refcount later.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/9bc62a84c6b9d6337781203f60837bd98fbc4a96.1756222464.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
At inode_logged() if we find that the inode was not logged before we
update its ->last_dir_index_offset to (u64)-1 with the goal that the
next directory log operation will see the (u64)-1 and then figure out
it must check what was the index of the last logged dir index key and
update ->last_dir_index_offset to that key's offset (this is done in
update_last_dir_index_offset()).
This however has a possibility for a time window where a race can happen
and lead to directory logging skipping dir index keys that should be
logged. The race happens like this:
1) Task A calls inode_logged(), sees ->logged_trans as 0 and then checks
that the inode item was logged before, but before it sets the inode's
->last_dir_index_offset to (u64)-1...
2) Task B is at btrfs_log_inode() which calls inode_logged() early, and
that has set ->last_dir_index_offset to (u64)-1;
3) Task B then enters log_directory_changes() which calls
update_last_dir_index_offset(). There it sees ->last_dir_index_offset
is (u64)-1 and that the inode was logged before (ctx->logged_before is
true), and so it searches for the last logged dir index key in the log
tree and it finds that it has an offset (index) value of N, so it sets
->last_dir_index_offset to N, so that we can skip index keys that are
less than or equal to N (later at process_dir_items_leaf());
4) Task A now sets ->last_dir_index_offset to (u64)-1, undoing the update
that task B just did;
5) Task B will now skip every index key when it enters
process_dir_items_leaf(), since ->last_dir_index_offset is (u64)-1.
Fix this by making inode_logged() not touch ->last_dir_index_offset and
initializing it to 0 when an inode is loaded (at btrfs_alloc_inode()) and
then having update_last_dir_index_offset() treat a value of 0 as meaning
we must check the log tree and update with the index of the last logged
index key. This is fine since the minimum possible value for
->last_dir_index_offset is 1 (BTRFS_DIR_START_INDEX - 1 = 2 - 1 = 1).
This also simplifies the management of ->last_dir_index_offset and now
all accesses to it are done under the inode's log_mutex.
Fixes: 0f8ce49821de ("btrfs: avoid inode logging during rename and link when possible")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of incrementing the inode's link count and refcount early before
adding the link, updating the inode and deleting orphan item, do it after
all those steps succeeded right before calling d_instantiate(). This makes
the error handling logic simpler by avoiding the need for the 'drop_inode'
variable to signal if we need to undo the link count increment and the
inode refcount increase under the 'fail' label.
This also reduces the level of indentation by one, making the code easier
to read.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we fail to update the inode or delete the orphan item we leak the inode
since we update its refcount with the ihold() call to account for the
d_instantiate() call which never happens in case we fail those steps. Fix
this by setting 'drop_inode' to true in case we fail those steps.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we fail to update the inode or delete the orphan item, we must abort
the transaction to prevent persisting an inconsistent state. For example
if we fail to update the inode item, we have the inconsistency of having
a persisted inode item with a link count of N but we have N + 1 inode ref
items and N + 1 directory entries pointing to our inode in case the
transaction gets committed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the fsverity_info pointer into the filesystem-specific part of the
inode by adding the field btrfs_inode::i_verity_info and configuring
fsverity_operations::inode_info_offs accordingly.
This is a prerequisite for a later commit that removes
inode::i_verity_info, saving memory and improving cache efficiency on
filesystems that don't support fsverity.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-12-ebiggers@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If we are doing an unlink for log replay, we are updating the directory's
mtime and ctime to the current time, and this is incorrect since it should
stay with the mtime and ctime that were set when the directory was logged.
This is the same as when adding a link to an inode during log replay (with
btrfs_add_link()), where we want the mtime and ctime to be the values that
were in place when the inode was logged.
This was found with generic/547 using LOAD_FACTOR=20 and TIME_FACTOR=20,
where due to large log trees we have longer log replay times and fssum
could detect a mismatch of the mtime and ctime of a directory.
Fix this by skipping the mtime and ctime update at __btrfs_unlink_inode()
if we are in log replay context (just like btrfs_add_link()).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside nocow_one_range(), if the checksum cloning for data reloc inode
failed, we call btrfs_cleanup_ordered_extents() to cleanup the just
allocated ordered extents.
But unlike extent_clear_unlock_delalloc(),
btrfs_cleanup_ordered_extents() requires a length, not an inclusive end
bytenr.
This can be problematic, as the @end is normally way larger than @len.
This means btrfs_cleanup_ordered_extents() can be called on folios
out of the correct range, and if the out-of-range folio is under
writeback, we can incorrectly clear the ordered flag of the folio, and
trigger the DEBUG_WARN() inside btrfs_writepage_cow_fixup().
Fix the wrong parameter with correct length instead.
Fixes: 94f6c5c17e52 ("btrfs: move ordered extent cleanup to where they are allocated")
CC: stable@vger.kernel.org # 6.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When hitting a large folio, btrfs_cleanup_ordered_extents() will get the
same large folio multiple times, and clearing the same range again and
again.
Thankfully this is not causing anything wrong, just inefficiency.
This is caused by the fact that we're iterating folios using the old
page index, thus can hit the same large folio again and again.
Enhance it by increasing @index to the index of the folio end, and only
increase @index by 1 if we failed to grab a folio.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.
The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.
Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):
$ mount -o compress=zstd:9 /dev/vda /mnt
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13364.. 13491: 128: 13440: encoded
2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 42% 124K 292K 292K
zstd 42% 124K 292K 292K
Defrag to uncompressed:
$ btrfs fi defrag --nocomp testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 291: 291840.. 292131: 292: last,eof
testfile: 1 extent found
$ compsize testfile
Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 292K 292K 292K
none 100% 292K 292K 292K
Compress again with LZO:
$ btrfs fi defrag -clzo testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13392.. 13519: 128: 13440: encoded
2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 64% 188K 292K 292K
lzo 64% 188K 292K 292K
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's just a simple wrapper around btrfs_clear_extent_bit() that passes a
NULL for its last argument (a cached extent state record), plus there is
not counter part - we have a btrfs_set_extent_bit() but we do not have a
btrfs_set_extent_bits() (plural version). So just remove it and make all
callers use btrfs_clear_extent_bit() directly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a cached extent state record from the previous extent locking so
we can use when setting the EXTENT_NORESERVE in the range, allowing the
operation to be faster if the extent io tree is relatively large.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Set the EXTENT_NORESERVE bit in the io tree before unlocking the range so
that we can use the cached state and speedup the operation, since the
unlock operation releases the cached state.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We call btrfs_check_nocow_lock() to see if we can NOCOW a block sized
range but we don't check later if we can NOCOW the whole range.
It's unexpected to be able to NOCOW a range smaller than blocksize, so
add an assertion to check the NOCOW range matches the blocksize.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The token versions of set/get accessors will be removed, use the normal
helpers.
There's additional overhead of the token helpers that update the cached
address in case it moves to another page/folio. The normal versions
don't need to do that.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Almost everywhere we want to use a btrfs inode and therefore we have a
lot of calls to BTRFS_I(), making the code more verbose. Instead use btrfs
inode local variables to avoid so much use of BTRFS_I().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to call d_inode(dentry) when calling btrfs_unlink_inode()
since we have already stored that in a local inode variable. So just use
the local variable to make the code less verbose.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Simplify folio_pos() + folio_size() and use the new helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The way zero_end is calculated and used does a -1 and +1 that
effectively cancel out, so this can be simplified. This is also
preparatory patch for using a helper for folio_pos + folio_size with the
semantics of exclusive end of the range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A few more remaining cases where we can use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Two remaining cases where we can use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid potential memory allocation failure in btrfs_truncate() as
the block reserve lifetime is limited to the scope of the function. This
requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid potential memory allocation failure in btrfs_evict_inode()
as the block reserve lifetime is limited to the scope of the function.
This requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|