summaryrefslogtreecommitdiffstats
path: root/drivers/md
AgeCommit message (Collapse)AuthorLines
2026-04-15Merge tag 'for-7.1/dm-changes' of ↵Linus Torvalds-840/+1239
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Benjamin Marzinski: "There are fixes for some corner case crashes in dm-cache and dm-mirror, new setup functionality for dm-vdo, and miscellaneous minor fixes and cleanups, especially to dm-verity. dm-vdo: - Make dm-vdo able to format the device itself, like other dm targets, instead of needing a userspace formating program - Add some sanity checks and code cleanup dm-cache: - Fix crashes and hangs when operating in passthrough mode (which have been around, unnoticed, since 4.12), as well as a late arriving fix for an error path bug in the passthrough fix - Fix a corner case memory leak dm-verity: - Another set of minor bugfixes and code cleanups to the forward error correction code dm-mirror - Fix minor initialization bug - Fix overflow crash on a large devices with small region sizes dm-crypt - Reimplement elephant diffuser using AES library and minor cleanups dm-core: - Claude found a buffer overflow in /dev/mapper/contrl ioctl handling - make dm_mod.wait_for correctly wait for partitions - minor code fixes and cleanups" * tag 'for-7.1/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits) dm cache: fix missing return in invalidate_committed's error path dm: fix a buffer overflow in ioctl processing dm-crypt: Make crypt_iv_operations::post return void dm vdo: Fix spelling mistake "postive" -> "positive" dm: provide helper to set stacked limits dm-integrity: always set the io hints dm-integrity: fix mismatched queue limits dm-bufio: use kzalloc_flex dm vdo: save the formatted metadata to disk dm vdo: add formatting logic and initialization dm vdo: add synchronous metadata I/O submission helper dm vdo: add geometry block structure dm vdo: add geometry block encoding dm vdo: add upfront validation for logical size dm vdo: add formatting parameters to table line dm vdo: add super block initialization to encodings.c dm vdo: add geometry block initialization to encodings.c dm-crypt: Make crypt_iv_operations::wipe return void dm-crypt: Reimplement elephant diffuser using AES library dm-verity-fec: warn even when there were no errors ...
2026-04-10dm cache: fix missing return in invalidate_committed's error pathMing-Hung Tsai-1/+3
In passthrough mode, dm-cache defers write submission until after metadata commit completes via the invalidate_committed() continuation. On commit error, invalidate_committed() calls invalidate_complete() to end the bio and free the migration struct, after which it should return immediately. The patch 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") omitted this early return, causing execution to fall through into the success path on error. This results in use-after-free on the migration struct in the subsequent calls. Fix by adding the missing return after the invalidate_complete() call. Fixes: 4ca8b8bd952d ("dm cache: fix write hang in passthrough mode") Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/dm-devel/adjMq6T5RRjv_uxM@stanley.mountain/ Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-04-09dm: fix a buffer overflow in ioctl processingMikulas Patocka-0/+4
Tony Asleson (using Claude) found a buffer overflow in dm-ioctl in the function retrieve_status: 1. The code in retrieve_status checks that the output string fits into the output buffer and writes the output string there 2. Then, the code aligns the "outptr" variable to the next 8-byte boundary: outptr = align_ptr(outptr); 3. The alignment doesn't check overflow, so outptr could point past the buffer end 4. The "for" loop is iterated again, it executes: remaining = len - (outptr - outbuf); 5. If "outptr" points past "outbuf + len", the arithmetics wraps around and the variable "remaining" contains unusually high number 6. With "remaining" being high, the code writes more data past the end of the buffer Luckily, this bug has no security implications because: 1. Only root can issue device mapper ioctls 2. The commonly used libraries that communicate with device mapper (libdevmapper and devicemapper-rs) use buffer size that is aligned to 8 bytes - thus, "outptr = align_ptr(outptr)" can't overshoot the input buffer and the bug can't happen accidentally Reported-by: Tony Asleson <tasleson@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Bryn M. Reeves <bmr@redhat.com> Cc: stable@vger.kernel.org
2026-04-07md/raid5: fix soft lockup in retry_aligned_read()Chia-Ming Chang-1/+7
When retry_aligned_read() encounters an overlapped stripe, it releases the stripe via raid5_release_stripe() which puts it on the lockless released_stripes llist. In the next raid5d loop iteration, release_stripe_list() drains the stripe onto handle_list (since STRIPE_HANDLE is set by the original IO), but retry_aligned_read() runs before handle_active_stripes() and removes the stripe from handle_list via find_get_stripe() -> list_del_init(). This prevents handle_stripe() from ever processing the stripe to resolve the overlap, causing an infinite loop and soft lockup. Fix this by using __release_stripe() with temp_inactive_list instead of raid5_release_stripe() in the failure path, so the stripe does not go through the released_stripes llist. This allows raid5d to break out of its loop, and the overlap will be resolved when the stripe is eventually processed by handle_stripe(). Fixes: 773ca82fa1ee ("raid5: make release_stripe lockless") Cc: stable@vger.kernel.org Signed-off-by: FengWei Shih <dannyshih@synology.com> Signed-off-by: Chia-Ming Chang <chiamingc@synology.com> Link: https://lore.kernel.org/linux-raid/20260402061406.455755-1-chiamingc@synology.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: wake raid456 reshape waiters before suspendYu Kuai-0/+11
During raid456 reshape, direct IO across the reshape position can sleep in raid5_make_request() waiting for reshape progress while still holding an active_io reference. If userspace then freezes reshape and writes md/suspend_lo or md/suspend_hi, mddev_suspend() kills active_io and waits for all in-flight IO to drain. This can deadlock: the IO needs reshape progress to continue, but the reshape thread is already frozen, so the active_io reference is never dropped and suspend never completes. raid5_prepare_suspend() already wakes wait_for_reshape for dm-raid. Do the same for normal md suspend when reshape is already interrupted, so waiting raid456 IO can abort, drop its reference, and let suspend finish. The mdadm test tests/25raid456-reshape-deadlock reproduces the hang. Fixes: 714d20150ed8 ("md: add new helpers to suspend/resume array") Link: https://lore.kernel.org/linux-raid/20260327140729.2030564-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid1: serialize overlap io for writemostly diskXiao Ni-14/+39
Previously, using wait_event() would wake up all waiters simultaneously, and they would compete for the tree lock. The bio which gets the lock first will be handled, so the write sequence cannot be guaranteed. For example: bio1(100,200) bio2(150,200) bio3(150,300) The write sequence of fast device is bio1,bio2,bio3. But the write sequence of slow device could be bio1,bio3,bio2 due to lock competition. This causes data corruption. Replace waitqueue with a fifo list to guarantee the write sequence. And it also needs to iterate the list when removing one entry. If not, it may miss the opportunity to wake up the waiting io. For example: bio1(1,3), bio2(2,4) bio3(5,7), bio4(6,8) These four bios are in the same bucket. bio1 and bio3 are inserted into the rbtree. bio2 and bio4 are added to the waiting list and bio2 is the first one. bio3 returns from slow disk and tries to wake up the waiting bios. bio2 is removed from the list and will be handled. But bio1 hasn't finished. So bio2 will be added into waiting list again. Then bio1 returns from slow disk and wakes up waiting bios. bio4 is removed from the list and will be handled. Now bio1, bio3 and bio4 all finish and bio2 is left on the waiting list. So it needs to iterate the waiting list to wake up the right bio. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260324072501.59865-1-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/md-llbitmap: optimize initial sync with write_zeroes_unmap supportYu Kuai-1/+61
For RAID-456 arrays with llbitmap, if all underlying disks support write_zeroes with unmap, issue write_zeroes to zero all disk data regions and initialize the bitmap to BitCleanUnwritten instead of BitUnwritten. This optimization skips the initial XOR parity building because: 1. write_zeroes with unmap guarantees zeroed reads after the operation 2. For RAID-456, when all data is zero, parity is automatically consistent (0 XOR 0 XOR ... = 0) 3. BitCleanUnwritten indicates parity is valid but no user data has been written The implementation adds two helper functions: - llbitmap_all_disks_support_wzeroes_unmap(): Checks if all active disks support write_zeroes with unmap - llbitmap_zero_all_disks(): Issues blkdev_issue_zeroout() to each rdev's data region to zero all disks The zeroing and bitmap state setting happens in llbitmap_init_state() during bitmap initialization. If any disk fails to zero, we fall back to BitUnwritten and normal lazy recovery. This significantly reduces array initialization time for RAID-456 arrays built on modern NVMe SSDs or other devices that support write_zeroes with unmap. Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-4-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/md-llbitmap: add CleanUnwritten state for RAID-5 proactive parity buildingYu Kuai-12/+128
Add new states to the llbitmap state machine to support proactive XOR parity building for RAID-5 arrays. This allows users to pre-build parity data for unwritten regions before any user data is written. New states added: - BitNeedSyncUnwritten: Transitional state when proactive sync is triggered via sysfs on Unwritten regions. - BitSyncingUnwritten: Proactive sync in progress for unwritten region. - BitCleanUnwritten: XOR parity has been pre-built, but no user data written yet. When user writes to this region, it transitions to BitDirty. New actions added: - BitmapActionProactiveSync: Trigger for proactive XOR parity building. - BitmapActionClearUnwritten: Convert CleanUnwritten/NeedSyncUnwritten/ SyncingUnwritten states back to Unwritten before recovery starts. State flows: - Current (lazy): Unwritten -> (write) -> NeedSync -> (sync) -> Dirty -> Clean - New (proactive): Unwritten -> (sysfs) -> NeedSyncUnwritten -> (sync) -> CleanUnwritten - On write to CleanUnwritten: CleanUnwritten -> (write) -> Dirty -> Clean - On disk replacement: CleanUnwritten regions are converted to Unwritten before recovery starts, so recovery only rebuilds regions with user data A new sysfs interface is added at /sys/block/mdX/md/llbitmap/proactive_sync (write-only) to trigger proactive sync. This only works for RAID-456 arrays. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-3-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: add fallback to correct bitmap_ops on version mismatchYu Kuai-1/+110
If default bitmap version and on-disk version doesn't match, and mdadm is not the latest version to set bitmap_type, set bitmap_ops based on the disk version. Link: https://lore.kernel.org/linux-raid/20260323054644.3351791-2-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid5: validate payload size before accessing journal metadataJunrui Luo-15/+33
r5c_recovery_analyze_meta_block() and r5l_recovery_verify_data_checksum_for_mb() iterate over payloads in a journal metadata block using on-disk payload size fields without validating them against the remaining space in the metadata block. A corrupted journal contains payload sizes extending beyond the PAGE_SIZE boundary can cause out-of-bounds reads when accessing payload fields or computing offsets. Add bounds validation for each payload type to ensure the full payload fits within meta_size before processing. Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1") Cc: stable@vger.kernel.org Signed-off-by: Junrui Luo <moonafterrain@outlook.com> Link: https://lore.kernel.org/linux-raid/SYBPR01MB78815E78D829BB86CD7C8015AF5FA@SYBPR01MB7881.ausprd01.prod.outlook.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md: remove unused static md_wq workqueueAbd-Alrhman Masalkhi-8/+0
The md_wq workqueue is defined as static and initialized in md_init(), but it is not used anywhere within md.c. All asynchronous and deferred work in this file is handled via md_misc_wq or dedicated md threads. Fixes: b75197e86e6d3 ("md: Remove flush handling") Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com> Link: https://lore.kernel.org/linux-raid/20260328193522.3624-1-abd.masalkhi@gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-07md/raid0: use kvzalloc/kvfree for strip_zone and devlist allocationsGregory Price-9/+9
syzbot reported a WARNING at mm/page_alloc.c:__alloc_frozen_pages_noprof() triggered by create_strip_zones() in the RAID0 driver. When raid_disks is large, the allocation size exceeds MAX_PAGE_ORDER (4MB on x86), causing WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER). Convert the strip_zone and devlist allocations from kzalloc/kzalloc_objs to kvzalloc/kvzalloc_objs, which first attempts a contiguous allocation with __GFP_NOWARN and then falls back to vmalloc for large sizes. Convert the corresponding kfree calls to kvfree. Both arrays are pure metadata lookup tables (arrays of pointers and zone descriptors) accessed only via indexing, so they do not require physically contiguous memory. Reported-by: syzbot+924649752adf0d3ac9dd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69adaba8.a00a0220.b130.0005.GAE@google.com/ Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20260308234202.3118119-1-gourry@gourry.net/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-06md: fix array_state=clear sysfs deadlockYu Kuai-1/+7
When "clear" is written to array_state, md_attr_store() breaks sysfs active protection so the array can delete itself from its own sysfs store method. However, md_attr_store() currently drops the mddev reference before calling sysfs_unbreak_active_protection(). Once do_md_stop(..., 0) has made the mddev eligible for delayed deletion, the temporary kobject reference taken by sysfs_break_active_protection() can become the last kobject reference protecting the md kobject. That allows sysfs_unbreak_active_protection() to drop the last kobject reference from the current sysfs writer context. kobject teardown then recurses into kernfs removal while the current sysfs node is still being unwound, and lockdep reports recursive locking on kn->active with kernfs_drain() in the call chain. Reproducer on an existing level: 1. Create an md0 linear array and activate it: mknod /dev/md0 b 9 0 echo none > /sys/block/md0/md/metadata_version echo linear > /sys/block/md0/md/level echo 1 > /sys/block/md0/md/raid_disks echo "$(cat /sys/class/block/sdb/dev)" > /sys/block/md0/md/new_dev echo "$(($(cat /sys/class/block/sdb/size) / 2))" > \ /sys/block/md0/md/dev-sdb/size echo 0 > /sys/block/md0/md/dev-sdb/slot echo active > /sys/block/md0/md/array_state 2. Wait briefly for the array to settle, then clear it: sleep 2 echo clear > /sys/block/md0/md/array_state The warning looks like: WARNING: possible recursive locking detected bash/588 is trying to acquire lock: (kn->active#65) at __kernfs_remove+0x157/0x1d0 but task is already holding lock: (kn->active#65) at sysfs_unbreak_active_protection+0x1f/0x40 ... Call Trace: kernfs_drain __kernfs_remove kernfs_remove_by_name_ns sysfs_remove_group sysfs_remove_groups __kobject_del kobject_put md_attr_store kernfs_fop_write_iter vfs_write ksys_write Restore active protection before mddev_put() so the extra sysfs kobject reference is dropped while the mddev is still held alive. The actual md kobject deletion is then deferred until after the sysfs write path has fully returned. Fixes: 9e59d609763f ("md: call del_gendisk in control path") Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260330055213.3976052-1-yukuai@fnnas.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-04-03bcache: fix uninitialized closure objectMingzhe Zou-1/+2
In the previous patch ("bcache: fix cached_dev.sb_bio use-after-free and crash"), we adopted a simple modification suggestion from AI to fix the use-after-free. But in actual testing, we found an extreme case where the device is stopped before calling bch_write_bdev_super(). At this point, struct closure sb_write has not been initialized yet. For this patch, we ensure that sb_bio has been completed via sb_write_mutex. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260403042135.2221247-1-colyli@fnnas.com Fixes: fec114a98b87 ("bcache: fix cached_dev.sb_bio use-after-free and crash") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-03bcache: fix cached_dev.sb_bio use-after-free and crashMingzhe Zou-0/+7
In our production environment, we have received multiple crash reports regarding libceph, which have caught our attention: ``` [6888366.280350] Call Trace: [6888366.280452] blk_update_request+0x14e/0x370 [6888366.280561] blk_mq_end_request+0x1a/0x130 [6888366.280671] rbd_img_handle_request+0x1a0/0x1b0 [rbd] [6888366.280792] rbd_obj_handle_request+0x32/0x40 [rbd] [6888366.280903] __complete_request+0x22/0x70 [libceph] [6888366.281032] osd_dispatch+0x15e/0xb40 [libceph] [6888366.281164] ? inet_recvmsg+0x5b/0xd0 [6888366.281272] ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph] [6888366.281405] ceph_con_process_message+0x79/0x140 [libceph] [6888366.281534] ceph_con_v1_try_read+0x5d7/0xf30 [libceph] [6888366.281661] ceph_con_workfn+0x329/0x680 [libceph] ``` After analyzing the coredump file, we found that the address of dc->sb_bio has been freed. We know that cached_dev is only freed when it is stopped. Since sb_bio is a part of struct cached_dev, rather than an alloc every time. If the device is stopped while writing to the superblock, the released address will be accessed at endio. This patch hopes to wait for sb_write to complete in cached_dev_free. It should be noted that we analyzed the cause of the problem, then tell all details to the QWEN and adopted the modifications it made. Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Fixes: cafe563591446 ("bcache: A block layer cache") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Coly Li <colyli@fnnas.com> Link: https://patch.msgid.link/20260322134102.480107-1-colyli@fnnas.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-30dm-crypt: Make crypt_iv_operations::post return voidEric Biggers-18/+13
Since all implementations of crypt_iv_operations::post now return 0, change the return type to void. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-30dm vdo: Fix spelling mistake "postive" -> "positive"Colin Ian King-1/+1
There is a spelling mistake in a vdo_log_error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-27dm: provide helper to set stacked limitsKeith Busch-27/+4
There are multiple device mappers that set up their stacking limits exactly the same for the logical, physical and minimum IO queue limits. Provide a helper for it. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-27dm-integrity: always set the io hintsKeith Busch-13/+8
Don't depend on the defaults to be what is desired if the integrity device was set up with 512b sector size. Always set the queue limits to be at least what the device mapper wants. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-27dm-integrity: fix mismatched queue limitsKeith Busch-3/+9
A user can integritysetup a device with a backing device using a 4k logical block size, but request the dm device use 1k or 2k. This mismatch creates an inconsistency such that the dm device would report limits for IO that it can't actually execute. Fix this by using the backing device's limits if they are larger. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm-bufio: use kzalloc_flexRosen Penev-2/+2
Avoid manual size calculations and use the proper helper. Add __counted_by for extra runtime analysis. Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: save the formatted metadata to diskBruce Johnston-20/+147
Add vdo_save_super_block() and vdo_save_geometry_block() to perform asynchronous writes of the super block and geometry block respectively. Add vdo_clear_layout() to zero the UDS index's first block, the block map partition, and the recovery journal partition. These operations are driven by new phases in the pre-load state machine (PRE_LOAD_PHASE_FORMAT_*), ensuring that disk writes happen during pre-resume rather than during dmsetup create. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add formatting logic and initializationBruce Johnston-25/+81
Add the core formatting logic. The initialization path is updated to read the geometry block (block 0 on the storage device). If the block is entirely zeroed, the device is treated as unformatted and vdo_format() is called. Otherwise, the existing geometry is parsed and the VDO is loaded as before. The vdo_format() function initializes the volume geometry and super block, and marks the VDO as needing it's layout saved to disk. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add synchronous metadata I/O submission helperBruce Johnston-13/+34
Add vdo_submit_metadata_vio_wait(), a synchronous I/O submission helper that blocks until completion. This is needed for I/O during early initialization before work queues are available. Refactor read_geometry_block() to use it. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block structureBruce Johnston-45/+66
Introduce a vdo_geometry_block structure, containing a vio and buffer, mirroring the existing vdo_super_block structure. Both are now initialized at VDO startup and freed at shutdown, establishing the infrastructure needed to read and write the geometry block using the same mechanisms as the super block. Refactor read_geometry_block() to use the new structure. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block encodingBruce Johnston-0/+58
Add vdo_encode_volume_geometry() to write the geometry block into a buffer so that it can be written to disk. The corresponding decode path already exists. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add upfront validation for logical sizeBruce Johnston-0/+6
Add a validation check that the logical size passed via the table line does not exceed MAXIMUM_VDO_LOGICAL_BLOCKS. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add formatting parameters to table lineBruce Johnston-17/+111
Extend the dm table line with three new optional parameters: indexMemory (UDS index memory size), indexSparse (dense vs sparse index), and slabSize (blocks per allocation slab). These values are parsed, validated, and stored in the device configuration for use during formatting. Rework the slab size constants from the single MAX_VDO_SLAB_BITS into explicit MIN_VDO_SLAB_BLOCKS, MAX_VDO_SLAB_BLOCKS, and DEFAULT_VDO_SLAB_BLOCKS values. Bump the target version from 9.1.0 to 9.2.0 to reflect this table line change. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add super block initialization to encodings.cBruce Johnston-0/+90
Add vdo_initialize_component_states() to populate the super block, computing the space required for the main VDO components on disk. Those include the slab depot, block map, and recovery journal. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-26dm vdo: add geometry block initialization to encodings.cBruce Johnston-0/+103
Add vdo_initialize_volume_geometry() to populate the geometry block, computing the space required for the two main regions on disk. Add uds_compute_index_size() to calculate the space required for the UDS indexer from the UDS configuration. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23dm-crypt: Make crypt_iv_operations::wipe return voidEric Biggers-14/+6
Since all implementations of crypt_iv_operations::wipe now return 0, change the return type to void. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23dm-crypt: Reimplement elephant diffuser using AES libraryEric Biggers-55/+31
Simplify and optimize dm-crypt's implementation of Bitlocker's "elephant diffuser" to use the AES library instead of an "ecb(aes)" crypto_skcipher. Note: struct aes_enckey is fixed-size, so it could be embedded directly in struct iv_elephant_private. But I kept it as a separate allocation so that the size of struct crypt_config doesn't increase. The elephant diffuser is rarely used in dm-crypt. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23dm-verity-fec: warn even when there were no errorsEric Biggers-1/+1
Currently FEC logs a warning message if at least one error was corrected, or an error message if there were uncorrectable errors. However, it doesn't log anything if there were no errors. "No errors" is actually unexpected, though, considering that dm-verity calls verity_fec_decode() only when a block's digest doesn't match. If there were to ever be a bug where verity_fec_decode() is called on blocks with the correct digest, then there would be no indication in the log that FEC is running and degrading performance. Therefore, let's log the warning message even when there were no errors. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-23md: remove unused mddev argument from export_rdevChen Cheng-14/+14
The mddev argument in export_rdev() is never used. Remove it to simplify callers. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20260304111417.20777-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23md/raid5: move handle_stripe() comment to correct locationChen Cheng-14/+12
Move the handle_stripe() documentation comment from above analyse_stripe() to directly above handle_stripe() where it belongs. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260304111001.15767-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23md/raid5: remove stale md_raid5_kick_device() declarationChen Cheng-1/+0
Remove the unused md_raid5_kick_device() declaration from raid5.h - no definition exists for this function. Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260304110919.15071-1-chencheng@fnnas.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-23md/raid1: fix the comparing region of interval treeXiao Ni-2/+2
Interval tree uses [start, end] as a region which stores in the tree. In raid1, it uses the wrong end value. For example: bio(A,B) is too big and needs to be split to bio1(A,C-1), bio2(C,B). The region of bio1 is [A,C] and the region of bio2 is [C,B]. So bio1 and bio2 overlap which is not right. Fix this problem by using right end value of the region. Fixes: d0d2d8ba0494 ("md/raid1: introduce wait_for_serialization") Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20260305011839.5118-2-xni@redhat.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-22md/raid5: skip 2-failure compute when other disk is R5_LOCKEDFengWei Shih-0/+2
When skip_copy is enabled on a doubly-degraded RAID6, a device that is being written to will be in R5_LOCKED state with R5_UPTODATE cleared. If a new read triggers fetch_block() while the write is still in flight, the 2-failure compute path may select this locked device as a compute target because it is not R5_UPTODATE. Because skip_copy makes the device page point directly to the bio page, reconstructing data into it might be risky. Also, since the compute marks the device R5_UPTODATE, it triggers WARN_ON in ops_run_io() which checks that R5_SkipCopy and R5_UPTODATE are not both set. This can be reproduced by running small-range concurrent read/write on a doubly-degraded RAID6 with skip_copy enabled, for example: mdadm -C /dev/md0 -l6 -n6 -R -f /dev/loop[0-3] missing missing echo 1 > /sys/block/md0/md/skip_copy fio --filename=/dev/md0 --rw=randrw --bs=4k --numjobs=8 \ --iodepth=32 --size=4M --runtime=30 --time_based --direct=1 Fix by checking R5_LOCKED before proceeding with the compute. The compute will be retried once the lock is cleared on IO completion. Signed-off-by: FengWei Shih <dannyshih@synology.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Link: https://lore.kernel.org/linux-raid/20260319053351.3676794-1-dannyshih@synology.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2026-03-18dm: make "dmsetup remove_all" interruptibleMikulas Patocka-10/+25
The command "dmsetup remove_all" may take a long time (a minute for removing 1000 devices), so make it interruptible with fatal signals. For better readability, the bool arguments were changed to flags. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
2026-03-18dm: don't report warning when doing deferred removeMikulas Patocka-1/+1
If dm_hash_remove_all was called from dm_deferred_remove, it would write a warning "remove_all left %d open device(s)" if there are some other devices active. The warning is bogus, so let's disable it in this case. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Cc: stable@vger.kernel.org Fixes: 2c140a246dc0 ("dm: allow remove to be deferred")
2026-03-18dm init: ensure device probing has finished in dm-mod.waitfor=Guillaume Gonnet-1/+3
The early_lookup_bdev() function returns successfully when the disk device is present but not necessarily its partitions. In this situation, dm_early_create() fails as the partition block device does not exist yet. In my case, this phenomenon occurs quite often because the device is an SD card with slow reading times, on which kernel takes time to enumerate available partitions. Fortunately, the underlying device is back to "probing" state while enumerating partitions. Waiting for all probing to end is enough to fix this issue. That's also the reason why this problem never occurs with rootwait= parameter: the while loop inside wait_for_root() explicitly waits for probing to be done and then the function calls async_synchronize_full(). These lines were omitted in 035641b, even though the commit says it's based on the rootwait logic... Anyway, calling wait_for_device_probe() after our while loop does the job (it both waits for probing and calls async_synchronize_full). Fixes: 035641b01e72 ("dm init: add dm-mod.waitfor to wait for asynchronously probed block devices") Signed-off-by: Guillaume Gonnet <ggonnet.linux@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-16md/md-llbitmap: raise barrier before state machine transitionYu Kuai-4/+4
Move the barrier raise operation before calling llbitmap_state_machine() in both llbitmap_start_write() and llbitmap_start_discard(). This ensures the barrier is in place before any state transitions occur, preventing potential race conditions where the state machine could complete before the barrier is properly raised. Cc: stable@vger.kernel.org Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-3-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-16md/md-llbitmap: skip reading rdevs that are not in_syncYu Kuai-1/+2
When reading bitmap pages from member disks, the code iterates through all rdevs and attempts to read from the first available one. However, it only checks for raid_disk assignment and Faulty flag, missing the In_sync flag check. This can cause bitmap data to be read from spare disks that are still being rebuilt and don't have valid bitmap information yet. Reading stale or uninitialized bitmap data from such disks can lead to incorrect dirty bit tracking, potentially causing data corruption during recovery or normal operation. Add the In_sync flag check to ensure bitmap pages are only read from fully synchronized member disks that have valid bitmap data. Cc: stable@vger.kernel.org Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Link: https://lore.kernel.org/linux-raid/20260223024038.3084853-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-16md/raid5: set chunk_sectors to enable full stripe I/O splittingYu Kuai-0/+1
Set chunk_sectors to the full stripe width (io_opt) so that the block layer splits I/O at full stripe boundaries. This ensures that large writes are aligned to full stripes, avoiding the read-modify-write overhead that occurs with partial stripe writes in RAID-5/6. When chunk_sectors is set, the block layer's bio splitting logic in get_max_io_size() uses blk_boundary_sectors_left() to limit I/O size to the boundary. This naturally aligns split bios to full stripe boundaries, enabling more efficient full stripe writes. Test results with 24-disk RAID5 (chunk_size=64k): dd if=/dev/zero of=/dev/md0 bs=10M oflag=direct Before: 461 MB/s After: 520 MB/s (+12.8%) Link: https://lore.kernel.org/linux-raid/20260223035834.3132498-1-yukuai@fnnas.com Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-16md/raid10: fix deadlock with check operation and nowait requestsJosh Hunt-2/+2
When an array check is running it will raise the barrier at which point normal requests will become blocked and increment the nr_pending value to signal there is work pending inside of wait_barrier(). NOWAIT requests do not block and so will return immediately with an error, and additionally do not increment nr_pending in wait_barrier(). Upstream change commit 43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request") added a call to raid_end_bio_io() to fix a memory leak when NOWAIT requests hit this condition. raid_end_bio_io() eventually calls allow_barrier() and it will unconditionally do an atomic_dec_and_test(&conf->nr_pending) even though the corresponding increment on nr_pending didn't happen in the NOWAIT case. This can be easily seen by starting a check operation while an application is doing nowait IO on the same array. This results in a deadlocked state due to nr_pending value underflowing and so the md resync thread gets stuck waiting for nr_pending to == 0. Output of r10conf state of the array when we hit this condition: crash> struct r10conf barrier = 1, nr_pending = { counter = -41 }, nr_waiting = 15, nr_queued = 0, Example of md_sync thread stuck waiting on raise_barrier() and other requests stuck in wait_barrier(): md1_resync [<0>] raise_barrier+0xce/0x1c0 [<0>] raid10_sync_request+0x1ca/0x1ed0 [<0>] md_do_sync+0x779/0x1110 [<0>] md_thread+0x90/0x160 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 kworker/u1040:2+flush-253:4 [<0>] wait_barrier+0x1de/0x220 [<0>] regular_request_wait+0x30/0x180 [<0>] raid10_make_request+0x261/0x1000 [<0>] md_handle_request+0x13b/0x230 [<0>] __submit_bio+0x107/0x1f0 [<0>] submit_bio_noacct_nocheck+0x16f/0x390 [<0>] ext4_io_submit+0x24/0x40 [<0>] ext4_do_writepages+0x254/0xc80 [<0>] ext4_writepages+0x84/0x120 [<0>] do_writepages+0x7a/0x260 [<0>] __writeback_single_inode+0x3d/0x300 [<0>] writeback_sb_inodes+0x1dd/0x470 [<0>] __writeback_inodes_wb+0x4c/0xe0 [<0>] wb_writeback+0x18b/0x2d0 [<0>] wb_workfn+0x2a1/0x400 [<0>] process_one_work+0x149/0x330 [<0>] worker_thread+0x2d2/0x410 [<0>] kthread+0xbe/0xf0 [<0>] ret_from_fork+0x34/0x50 [<0>] ret_from_fork_asm+0x1a/0x30 Fixes: 43806c3d5b9b ("raid10: cleanup memleak at raid10_make_request") Cc: stable@vger.kernel.org Signed-off-by: Josh Hunt <johunt@akamai.com> Link: https://lore.kernel.org/linux-raid/20260303005619.1352958-1-johunt@akamai.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-16md: suppress spurious superblock update error message for dm-raidChen Cheng-1/+3
dm-raid has external metadata management (mddev->external = 1) and no persistent superblock (mddev->persistent = 0). For these arrays, there's no superblock to update, so the error message is spurious. The error appears as: md_update_sb: can't update sb for read-only array md0 Fixes: 8c9e376b9d1a ("md: warn about updating super block failure") Reported-by: Tj <tj.iam.tj@proton.me> Closes: https://lore.kernel.org/all/20260128082430.96788-1-tj.iam.tj@proton.me/ Signed-off-by: Chen Cheng <chencheng@fnnas.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://lore.kernel.org/linux-raid/20260210133847.269986-1-chencheng@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com>
2026-03-09block: remove bdev_nonrot()Damien Le Moal-3/+3
bdev_nonrot() is simply the negative return value of bdev_rot(). So replace all call sites of bdev_nonrot() with calls to bdev_rot() and remove bdev_nonrot(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-09dm-verity-fec: improve comments for fec_read_bufs()Eric Biggers-8/+22
Update the comments in and above fec_read_bufs() to more clearly describe what it does. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-09dm-verity-fec: log target_block instead of index_in_regionEric Biggers-12/+14
The log message for a FEC error or correction includes the data device name and index_in_region as the context. Although the result of FEC (for a particular dm-verity instance) is expected to be the same for a given index_in_region, index_in_region does not uniquely identify the actual target block that is being corrected. Since that value (target_block) is likely more useful, log it instead. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2026-03-09dm-verity-fec: make fec_decode_bufs() just return 0 or errorEric Biggers-7/+4
fec_decode_bufs() returns the number of errors corrected or a negative errno value. However, the caller just checks for an errno value and doesn't do anything with the number of errors corrected. Simplify the code by just returning 0 instead of the number of errors corrected. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>