| Age | Commit message (Collapse) | Author | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Benjamin Marzinski:
"There are fixes for some corner case crashes in dm-cache and
dm-mirror, new setup functionality for dm-vdo, and miscellaneous minor
fixes and cleanups, especially to dm-verity.
dm-vdo:
- Make dm-vdo able to format the device itself, like other dm
targets, instead of needing a userspace formating program
- Add some sanity checks and code cleanup
dm-cache:
- Fix crashes and hangs when operating in passthrough mode (which
have been around, unnoticed, since 4.12), as well as a late
arriving fix for an error path bug in the passthrough fix
- Fix a corner case memory leak
dm-verity:
- Another set of minor bugfixes and code cleanups to the forward
error correction code
dm-mirror
- Fix minor initialization bug
- Fix overflow crash on a large devices with small region sizes
dm-crypt
- Reimplement elephant diffuser using AES library and minor cleanups
dm-core:
- Claude found a buffer overflow in /dev/mapper/contrl ioctl handling
- make dm_mod.wait_for correctly wait for partitions
- minor code fixes and cleanups"
* tag 'for-7.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
dm cache: fix missing return in invalidate_committed's error path
dm: fix a buffer overflow in ioctl processing
dm-crypt: Make crypt_iv_operations::post return void
dm vdo: Fix spelling mistake "postive" -> "positive"
dm: provide helper to set stacked limits
dm-integrity: always set the io hints
dm-integrity: fix mismatched queue limits
dm-bufio: use kzalloc_flex
dm vdo: save the formatted metadata to disk
dm vdo: add formatting logic and initialization
dm vdo: add synchronous metadata I/O submission helper
dm vdo: add geometry block structure
dm vdo: add geometry block encoding
dm vdo: add upfront validation for logical size
dm vdo: add formatting parameters to table line
dm vdo: add super block initialization to encodings.c
dm vdo: add geometry block initialization to encodings.c
dm-crypt: Make crypt_iv_operations::wipe return void
dm-crypt: Reimplement elephant diffuser using AES library
dm-verity-fec: warn even when there were no errors
...
|
|
In passthrough mode, dm-cache defers write submission until after
metadata commit completes via the invalidate_committed() continuation.
On commit error, invalidate_committed() calls invalidate_complete() to
end the bio and free the migration struct, after which it should return
immediately.
The patch 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode")
omitted this early return, causing execution to fall through into the
success path on error. This results in use-after-free on the migration
struct in the subsequent calls.
Fix by adding the missing return after the invalidate_complete() call.
Fixes: 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode")
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/
Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Tony Asleson (using Claude) found a buffer overflow in dm-ioctl in the
function retrieve_status:
1. The code in retrieve_status checks that the output string fits into
the output buffer and writes the output string there
2. Then, the code aligns the "outptr" variable to the next 8-byte
boundary:
outptr = align_ptr(outptr);
3. The alignment doesn't check overflow, so outptr could point past the
buffer end
4. The "for" loop is iterated again, it executes:
remaining = len - (outptr - outbuf);
5. If "outptr" points past "outbuf + len", the arithmetics wraps around
and the variable "remaining" contains unusually high number
6. With "remaining" being high, the code writes more data past the end of
the buffer
Luckily, this bug has no security implications because:
1. Only root can issue device mapper ioctls
2. The commonly used libraries that communicate with device mapper
(libdevmapper and devicemapper-rs) use buffer size that is aligned to
8 bytes - thus, "outptr = align_ptr(outptr)" can't overshoot the input
buffer and the bug can't happen accidentally
Reported-by: Tony Asleson <tasleson@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Bryn M. Reeves <bmr@redhat.com>
Cc: stable@vger.kernel.org
|
|
When retry_aligned_read() encounters an overlapped stripe, it releases
the stripe via raid5_release_stripe() which puts it on the lockless
released_stripes llist. In the next raid5d loop iteration,
release_stripe_list() drains the stripe onto handle_list (since
STRIPE_HANDLE is set by the original IO), but retry_aligned_read()
runs before handle_active_stripes() and removes the stripe from
handle_list via find_get_stripe() -> list_del_init(). This prevents
handle_stripe() from ever processing the stripe to resolve the
overlap, causing an infinite loop and soft lockup.
Fix this by using __release_stripe() with temp_inactive_list instead
of raid5_release_stripe() in the failure path, so the stripe does not
go through the released_stripes llist. This allows raid5d to break out
of its loop, and the overlap will be resolved when the stripe is
eventually processed by handle_stripe().
Fixes: 773ca82fa1ee ("raid5: make release_stripe lockless")
Cc: stable@vger.kernel.org
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Signed-off-by: Chia-Ming Chang <chiamingc@synology.com>
Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
During raid456 reshape, direct IO across the reshape position can sleep
in raid5_make_request() waiting for reshape progress while still
holding an active_io reference. If userspace then freezes reshape and
writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io
and waits for all in-flight IO to drain.
This can deadlock: the IO needs reshape progress to continue, but the
reshape thread is already frozen, so the active_io reference is never
dropped and suspend never completes.
raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do
the same for normal md suspend when reshape is already interrupted, so
waiting raid456 IO can abort, drop its reference, and let suspend
finish.
The mdadm test tests/25raid456-reshape-deadlock reproduces the hang.
Fixes: 714d20150ed8 ("md: add new helpers to suspend/resume array")
Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
Previously, using wait_event() would wake up all waiters simultaneously,
and they would compete for the tree lock. The bio which gets the lock
first will be handled, so the write sequence cannot be guaranteed.
For example:
bio1(100,200)
bio2(150,200)
bio3(150,300)
The write sequence of fast device is bio1,bio2,bio3. But the write sequence
of slow device could be bio1,bio3,bio2 due to lock competition. This causes
data corruption.
Replace waitqueue with a fifo list to guarantee the write sequence. And it
also needs to iterate the list when removing one entry. If not, it may miss
the opportunity to wake up the waiting io.
For example:
bio1(1,3), bio2(2,4)
bio3(5,7), bio4(6,8)
These four bios are in the same bucket. bio1 and bio3 are inserted into
the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the
first one. bio3 returns from slow disk and tries to wake up the waiting
bios. bio2 is removed from the list and will be handled. But bio1 hasn't
finished. So bio2 will be added into waiting list again. Then bio1 returns
from slow disk and wakes up waiting bios. bio4 is removed from the list
and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left
on the waiting list. So it needs to iterate the waiting list to wake up
the right bio.
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
For RAID-456 arrays with llbitmap, if all underlying disks support
write_zeroes with unmap, issue write_zeroes to zero all disk data
regions and initialize the bitmap to BitCleanUnwritten instead of
BitUnwritten.
This optimization skips the initial XOR parity building because:
1. write_zeroes with unmap guarantees zeroed reads after the operation
2. For RAID-456, when all data is zero, parity is automatically
consistent (0 XOR 0 XOR ... = 0)
3. BitCleanUnwritten indicates parity is valid but no user data
has been written
The implementation adds two helper functions:
- llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active
disks support write_zeroes with unmap
- llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each
rdev's data region to zero all disks
The zeroing and bitmap state setting happens in llbitmap_init_state()
during bitmap initialization. If any disk fails to zero, we fall back
to BitUnwritten and normal lazy recovery.
This significantly reduces array initialization time for RAID-456
arrays built on modern NVMe SSDs or other devices that support
write_zeroes with unmap.
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
Add new states to the llbitmap state machine to support proactive XOR
parity building for RAID-5 arrays. This allows users to pre-build parity
data for unwritten regions before any user data is written.
New states added:
- BitNeedSyncUnwritten: Transitional state when proactive sync is triggered
via sysfs on Unwritten regions.
- BitSyncingUnwritten: Proactive sync in progress for unwritten region.
- BitCleanUnwritten: XOR parity has been pre-built, but no user data
written yet. When user writes to this region, it transitions to BitDirty.
New actions added:
- BitmapActionProactiveSync: Trigger for proactive XOR parity building.
- BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/
SyncingUnwritten states back to Unwritten before recovery starts.
State flows:
- Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean
- New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten
- On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean
- On disk replacement: CleanUnwritten regions are converted to Unwritten
before recovery starts, so recovery only rebuilds regions with user data
A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync
(write-only) to trigger proactive sync. This only works for RAID-456 arrays.
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
If default bitmap version and on-disk version doesn't match, and mdadm
is not the latest version to set bitmap_type, set bitmap_ops based on
the disk version.
Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
r5c_recovery_analyze_meta_block() and
r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a
journal metadata block using on-disk payload size fields without
validating them against the remaining space in the metadata block.
A corrupted journal contains payload sizes extending beyond the PAGE_SIZE
boundary can cause out-of-bounds reads when accessing payload fields or
computing offsets.
Add bounds validation for each payload type to ensure the full payload
fits within meta_size before processing.
Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1")
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
The md_wq workqueue is defined as static and initialized in md_init(),
but it is not used anywhere within md.c.
All asynchronous and deferred work in this file is handled via
md_misc_wq or dedicated md threads.
Fixes: b75197e86e6d3 ("md: Remove flush handling")
Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof()
triggered by create_strip_zones() in the RAID0 driver.
When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB
on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER).
Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to
kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with
__GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the
corresponding kfree calls to kvfree.
Both arrays are pure metadata lookup tables (arrays of pointers and zone
descriptors) accessed only via indexing, so they do not require physically
contiguous memory.
Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
When "clear" is written to array_state, md_attr_store() breaks sysfs
active protection so the array can delete itself from its own sysfs
store method.
However, md_attr_store() currently drops the mddev reference before
calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0)
has made the mddev eligible for delayed deletion, the temporary
kobject reference taken by sysfs_break_active_protection() can become
the last kobject reference protecting the md kobject.
That allows sysfs_unbreak_active_protection() to drop the last
kobject reference from the current sysfs writer context. kobject
teardown then recurses into kernfs removal while the current sysfs
node is still being unwound, and lockdep reports recursive locking on
kn->active with kernfs_drain() in the call chain.
Reproducer on an existing level:
1. Create an md0 linear array and activate it:
mknod /dev/md0 b 9 0
echo none > /sys/block/md0/md/metadata_version
echo linear > /sys/block/md0/md/level
echo 1 > /sys/block/md0/md/raid_disks
echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev
echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \
/sys/block/md0/md/dev-sdb/size
echo 0 > /sys/block/md0/md/dev-sdb/slot
echo active > /sys/block/md0/md/array_state
2. Wait briefly for the array to settle, then clear it:
sleep 2
echo clear > /sys/block/md0/md/array_state
The warning looks like:
WARNING: possible recursive locking detected
bash/588 is trying to acquire lock:
(kn->active#65) at __kernfs_remove+0x157/0x1d0
but task is already holding lock:
(kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40
...
Call Trace:
kernfs_drain
__kernfs_remove
kernfs_remove_by_name_ns
sysfs_remove_group
sysfs_remove_groups
__kobject_del
kobject_put
md_attr_store
kernfs_fop_write_iter
vfs_write
ksys_write
Restore active protection before mddev_put() so the extra sysfs
kobject reference is dropped while the mddev is still held alive. The
actual md kobject deletion is then deferred until after the sysfs
write path has fully returned.
Fixes: 9e59d609763f ("md: call del_gendisk in control path")
Reviewed-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and
crash"), we adopted a simple modification suggestion from AI to fix the
use-after-free.
But in actual testing, we found an extreme case where the device is
stopped before calling bch_write_bdev_super().
At this point, struct closure sb_write has not been initialized yet.
For this patch, we ensure that sb_bio has been completed via
sb_write_mutex.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com
Fixes: fec114a98b87 ("bcache: fix cached_dev.sb_bio use-after-free and crash")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In our production environment, we have received multiple crash reports
regarding libceph, which have caught our attention:
```
[6888366.280350] Call Trace:
[6888366.280452] blk_update_request+0x14e/0x370
[6888366.280561] blk_mq_end_request+0x1a/0x130
[6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd]
[6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd]
[6888366.280903] __complete_request+0x22/0x70 [libceph]
[6888366.281032] osd_dispatch+0x15e/0xb40 [libceph]
[6888366.281164] ? inet_recvmsg+0x5b/0xd0
[6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
[6888366.281405] ceph_con_process_message+0x79/0x140 [libceph]
[6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph]
[6888366.281661] ceph_con_workfn+0x329/0x680 [libceph]
```
After analyzing the coredump file, we found that the address of
dc->sb_bio has been freed. We know that cached_dev is only freed when it
is stopped.
Since sb_bio is a part of struct cached_dev, rather than an alloc every
time. If the device is stopped while writing to the superblock, the
released address will be accessed at endio.
This patch hopes to wait for sb_write to complete in cached_dev_free.
It should be noted that we analyzed the cause of the problem, then tell
all details to the QWEN and adopted the modifications it made.
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Fixes: cafe563591446 ("bcache: A block layer cache")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Coly Li <colyli@fnnas.com>
Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since all implementations of crypt_iv_operations::post now return 0,
change the return type to void.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
There is a spelling mistake in a vdo_log_error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
There are multiple device mappers that set up their stacking limits
exactly the same for the logical, physical and minimum IO queue limits.
Provide a helper for it.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Don't depend on the defaults to be what is desired if the integrity
device was set up with 512b sector size. Always set the queue limits to
be at least what the device mapper wants.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
A user can integritysetup a device with a backing device using a 4k
logical block size, but request the dm device use 1k or 2k. This
mismatch creates an inconsistency such that the dm device would report
limits for IO that it can't actually execute. Fix this by using the
backing device's limits if they are larger.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Avoid manual size calculations and use the proper helper.
Add __counted_by for extra runtime analysis.
Signed-off-by: Rosen Penev <rosenp@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add vdo_save_super_block() and vdo_save_geometry_block() to perform
asynchronous writes of the super block and geometry block respectively.
Add vdo_clear_layout() to zero the UDS index's first block, the block
map partition, and the recovery journal partition.
These operations are driven by new phases in the pre-load state machine
(PRE_LOAD_PHASE_FORMAT_*), ensuring that disk writes happen during
pre-resume rather than during dmsetup create.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add the core formatting logic. The initialization path is updated to
read the geometry block (block 0 on the storage device). If the block
is entirely zeroed, the device is treated as unformatted and
vdo_format() is called. Otherwise, the existing geometry is parsed
and the VDO is loaded as before.
The vdo_format() function initializes the volume geometry and super
block, and marks the VDO as needing it's layout saved to disk.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add vdo_submit_metadata_vio_wait(), a synchronous I/O submission
helper that blocks until completion. This is needed for I/O during
early initialization before work queues are available.
Refactor read_geometry_block() to use it.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Introduce a vdo_geometry_block structure, containing a vio and buffer,
mirroring the existing vdo_super_block structure. Both are now
initialized at VDO startup and freed at shutdown, establishing the
infrastructure needed to read and write the geometry block using the
same mechanisms as the super block.
Refactor read_geometry_block() to use the new structure.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add vdo_encode_volume_geometry() to write the geometry block into a
buffer so that it can be written to disk. The corresponding decode
path already exists.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add a validation check that the logical size passed via the table line
does not exceed MAXIMUM_VDO_LOGICAL_BLOCKS.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Extend the dm table line with three new optional parameters:
indexMemory (UDS index memory size), indexSparse (dense vs sparse
index), and slabSize (blocks per allocation slab). These values are
parsed, validated, and stored in the device configuration for use
during formatting.
Rework the slab size constants from the single MAX_VDO_SLAB_BITS into
explicit MIN_VDO_SLAB_BLOCKS, MAX_VDO_SLAB_BLOCKS, and
DEFAULT_VDO_SLAB_BLOCKS values.
Bump the target version from 9.1.0 to 9.2.0 to reflect this table
line change.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add vdo_initialize_component_states() to populate the super block,
computing the space required for the main VDO components on disk.
Those include the slab depot, block map, and recovery journal.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Add vdo_initialize_volume_geometry() to populate the geometry block,
computing the space required for the two main regions on disk.
Add uds_compute_index_size() to calculate the space required for the
UDS indexer from the UDS configuration.
Signed-off-by: Bruce Johnston <bjohnsto@redhat.com>
Reviewed-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Since all implementations of crypt_iv_operations::wipe now return 0,
change the return type to void.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Simplify and optimize dm-crypt's implementation of Bitlocker's "elephant
diffuser" to use the AES library instead of an "ecb(aes)"
crypto_skcipher.
Note: struct aes_enckey is fixed-size, so it could be embedded directly
in struct iv_elephant_private. But I kept it as a separate allocation
so that the size of struct crypt_config doesn't increase. The elephant
diffuser is rarely used in dm-crypt.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Currently FEC logs a warning message if at least one error was
corrected, or an error message if there were uncorrectable errors.
However, it doesn't log anything if there were no errors.
"No errors" is actually unexpected, though, considering that dm-verity
calls verity_fec_decode() only when a block's digest doesn't match.
If there were to ever be a bug where verity_fec_decode() is called on
blocks with the correct digest, then there would be no indication in the
log that FEC is running and degrading performance.
Therefore, let's log the warning message even when there were no errors.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
The mddev argument in export_rdev() is never used. Remove it to
simplify callers.
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Move the handle_stripe() documentation comment from above
analyse_stripe() to directly above handle_stripe() where it belongs.
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304111001.15767-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Remove the unused md_raid5_kick_device() declaration from raid5.h -
no definition exists for this function.
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260304110919.15071-1-chencheng@fnnas.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Interval tree uses [start, end] as a region which stores in the tree.
In raid1, it uses the wrong end value. For example:
bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B).
The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and
bio2 overlap which is not right.
Fix this problem by using right end value of the region.
Fixes: d0d2d8ba0494 ("md/raid1: introduce wait_for_serialization")
Signed-off-by: Xiao Ni <xni@redhat.com>
Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
When skip_copy is enabled on a doubly-degraded RAID6, a device that is
being written to will be in R5_LOCKED state with R5_UPTODATE cleared.
If a new read triggers fetch_block() while the write is still in
flight, the 2-failure compute path may select this locked device as a
compute target because it is not R5_UPTODATE.
Because skip_copy makes the device page point directly to the bio page,
reconstructing data into it might be risky. Also, since the compute
marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io()
which checks that R5_SkipCopy and R5_UPTODATE are not both set.
This can be reproduced by running small-range concurrent read/write on
a doubly-degraded RAID6 with skip_copy enabled, for example:
mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing
echo 1 > /sys/block/md0/md/skip_copy
fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \
--iodepth=32 --size=4M --runtime=30 --time_based --direct=1
Fix by checking R5_LOCKED before proceeding with the compute. The
compute will be retried once the lock is cleared on IO completion.
Signed-off-by: FengWei Shih <dannyshih@synology.com>
Reviewed-by: Yu Kuai <yukuai@fnnas.com>
Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
The command "dmsetup remove_all" may take a long time (a minute for
removing 1000 devices), so make it interruptible with fatal signals.
For better readability, the bool arguments were changed to flags.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
|
|
If dm_hash_remove_all was called from dm_deferred_remove, it would write
a warning "remove_all left %d open device(s)" if there are some other
devices active.
The warning is bogus, so let's disable it in this case.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Zdenek Kabelac <zkabelac@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 2c140a246dc0 ("dm: allow remove to be deferred")
|
|
The early_lookup_bdev() function returns successfully when the disk
device is present but not necessarily its partitions. In this situation,
dm_early_create() fails as the partition block device does not exist
yet.
In my case, this phenomenon occurs quite often because the device is
an SD card with slow reading times, on which kernel takes time to
enumerate available partitions.
Fortunately, the underlying device is back to "probing" state while
enumerating partitions. Waiting for all probing to end is enough to fix
this issue.
That's also the reason why this problem never occurs with rootwait=
parameter: the while loop inside wait_for_root() explicitly waits for
probing to be done and then the function calls async_synchronize_full().
These lines were omitted in 035641b, even though the commit says it's
based on the rootwait logic...
Anyway, calling wait_for_device_probe() after our while loop does the
job (it both waits for probing and calls async_synchronize_full).
Fixes: 035641b01e72 ("dm init: add dm-mod.waitfor to wait for asynchronously probed block devices")
Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
Move the barrier raise operation before calling llbitmap_state_machine()
in both llbitmap_start_write() and llbitmap_start_discard(). This
ensures the barrier is in place before any state transitions occur,
preventing potential race conditions where the state machine could
complete before the barrier is properly raised.
Cc: stable@vger.kernel.org
Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap")
Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-3-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
When reading bitmap pages from member disks, the code iterates through
all rdevs and attempts to read from the first available one. However,
it only checks for raid_disk assignment and Faulty flag, missing the
In_sync flag check.
This can cause bitmap data to be read from spare disks that are still
being rebuilt and don't have valid bitmap information yet. Reading
stale or uninitialized bitmap data from such disks can lead to
incorrect dirty bit tracking, potentially causing data corruption
during recovery or normal operation.
Add the In_sync flag check to ensure bitmap pages are only read from
fully synchronized member disks that have valid bitmap data.
Cc: stable@vger.kernel.org
Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap")
Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-2-yukuai@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
Set chunk_sectors to the full stripe width (io_opt) so that the block
layer splits I/O at full stripe boundaries. This ensures that large
writes are aligned to full stripes, avoiding the read-modify-write
overhead that occurs with partial stripe writes in RAID-5/6.
When chunk_sectors is set, the block layer's bio splitting logic in
get_max_io_size() uses blk_boundary_sectors_left() to limit I/O size
to the boundary. This naturally aligns split bios to full stripe
boundaries, enabling more efficient full stripe writes.
Test results with 24-disk RAID5 (chunk_size=64k):
dd if=/dev/zero of=/dev/md0 bs=10M oflag=direct
Before: 461 MB/s
After: 520 MB/s (+12.8%)
Link: https://lore.kernel.org/linux-raid/20260223035834.3132498-1-yukuai@fnnas.com
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
When an array check is running it will raise the barrier at which point
normal requests will become blocked and increment the nr_pending value to
signal there is work pending inside of wait_barrier(). NOWAIT requests
do not block and so will return immediately with an error, and additionally
do not increment nr_pending in wait_barrier(). Upstream change commit
43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request") added a
call to raid_end_bio_io() to fix a memory leak when NOWAIT requests hit
this condition. raid_end_bio_io() eventually calls allow_barrier() and
it will unconditionally do an atomic_dec_and_test(&conf->nr_pending) even
though the corresponding increment on nr_pending didn't happen in the
NOWAIT case.
This can be easily seen by starting a check operation while an application
is doing nowait IO on the same array. This results in a deadlocked state
due to nr_pending value underflowing and so the md resync thread gets stuck
waiting for nr_pending to == 0.
Output of r10conf state of the array when we hit this condition:
crash> struct r10conf
barrier = 1,
nr_pending = {
counter = -41
},
nr_waiting = 15,
nr_queued = 0,
Example of md_sync thread stuck waiting on raise_barrier() and other
requests stuck in wait_barrier():
md1_resync
[<0>] raise_barrier+0xce/0x1c0
[<0>] raid10_sync_request+0x1ca/0x1ed0
[<0>] md_do_sync+0x779/0x1110
[<0>] md_thread+0x90/0x160
[<0>] kthread+0xbe/0xf0
[<0>] ret_from_fork+0x34/0x50
[<0>] ret_from_fork_asm+0x1a/0x30
kworker/u1040:2+flush-253:4
[<0>] wait_barrier+0x1de/0x220
[<0>] regular_request_wait+0x30/0x180
[<0>] raid10_make_request+0x261/0x1000
[<0>] md_handle_request+0x13b/0x230
[<0>] __submit_bio+0x107/0x1f0
[<0>] submit_bio_noacct_nocheck+0x16f/0x390
[<0>] ext4_io_submit+0x24/0x40
[<0>] ext4_do_writepages+0x254/0xc80
[<0>] ext4_writepages+0x84/0x120
[<0>] do_writepages+0x7a/0x260
[<0>] __writeback_single_inode+0x3d/0x300
[<0>] writeback_sb_inodes+0x1dd/0x470
[<0>] __writeback_inodes_wb+0x4c/0xe0
[<0>] wb_writeback+0x18b/0x2d0
[<0>] wb_workfn+0x2a1/0x400
[<0>] process_one_work+0x149/0x330
[<0>] worker_thread+0x2d2/0x410
[<0>] kthread+0xbe/0xf0
[<0>] ret_from_fork+0x34/0x50
[<0>] ret_from_fork_asm+0x1a/0x30
Fixes: 43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request")
Cc: stable@vger.kernel.org
Signed-off-by: Josh Hunt <johunt@akamai.com>
Link: https://lore.kernel.org/linux-raid/20260303005619.1352958-1-johunt@akamai.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
dm-raid has external metadata management (mddev->external = 1) and
no persistent superblock (mddev->persistent = 0). For these arrays,
there's no superblock to update, so the error message is spurious.
The error appears as:
md_update_sb: can't update sb for read-only array md0
Fixes: 8c9e376b9d1a ("md: warn about updating super block failure")
Reported-by: Tj <tj.iam.tj@proton.me>
Closes: https://lore.kernel.org/all/20260128082430.96788-1-tj.iam.tj@proton.me/
Signed-off-by: Chen Cheng <chencheng@fnnas.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/linux-raid/20260210133847.269986-1-chencheng@fnnas.com
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
|
|
bdev_nonrot() is simply the negative return value of bdev_rot().
So replace all call sites of bdev_nonrot() with calls to bdev_rot()
and remove bdev_nonrot().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Update the comments in and above fec_read_bufs() to more clearly
describe what it does.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
The log message for a FEC error or correction includes the data device
name and index_in_region as the context. Although the result of FEC
(for a particular dm-verity instance) is expected to be the same for a
given index_in_region, index_in_region does not uniquely identify the
actual target block that is being corrected. Since that value
(target_block) is likely more useful, log it instead.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|
|
fec_decode_bufs() returns the number of errors corrected or a negative
errno value. However, the caller just checks for an errno value and
doesn't do anything with the number of errors corrected. Simplify the
code by just returning 0 instead of the number of errors corrected.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|