From 981e3801443f507d74e2dae5710452642c96e8e3 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 26 Feb 2025 18:44:23 -0500 Subject: bcachefs: Kick devices out after too many write IO errors We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet --- fs/bcachefs/super-io.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'fs/bcachefs/super-io.c') diff --git a/fs/bcachefs/super-io.c b/fs/bcachefs/super-io.c index 918e4e7704dd..ee32d043414a 100644 --- a/fs/bcachefs/super-io.c +++ b/fs/bcachefs/super-io.c @@ -454,6 +454,9 @@ static int bch2_sb_validate(struct bch_sb_handle *disk_sb, if (le16_to_cpu(sb->version) <= bcachefs_metadata_version_disk_accounting_v2) SET_BCH_SB_PROMOTE_WHOLE_EXTENTS(sb, true); + + if (!BCH_SB_WRITE_ERROR_TIMEOUT(sb)) + SET_BCH_SB_WRITE_ERROR_TIMEOUT(sb, 30); } #ifdef __KERNEL__ -- cgit v1.2.3