aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/hw/mlx5 (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds6-25/+129
Pull rdma updates from Jason Gunthorpe: "A new Pensando ionic driver, a new Gen 3 HW support for Intel irdma, and lots of small bnxt_re improvements. - Small bug fixes and improves to hfi1, efa, mlx5, erdma, rdmarvt, siw - Allow userspace access to IB service records through the rdmacm - Optimize dma mapping for erdma - Fix shutdown of the GSI QP in mana - Support relaxed ordering MR and fix a corruption bug with mlx5 DMA Data Direct - Many improvement to bnxt_re: - Debugging features and counters - Improve performance of some commands - Change flow_label reporting in completions - Mirror vnic - RDMA flow support - New RDMA driver for Pensando Ethernet devices: ionic - Gen 3 hardware support for the Intel irdma driver - Fix rdma routing resolution with VRFs" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (85 commits) RDMA/ionic: Fix memory leak of admin q_wr RDMA/siw: Always report immediate post SQ errors RDMA/bnxt_re: improve clarity in ALLOC_PAGE handler RDMA/irdma: Remove unused struct irdma_cq fields RDMA/irdma: Fix positive vs negative error codes in irdma_post_send() RDMA/bnxt_re: Remove non-statistics counters from hw_counters RDMA/bnxt_re: Add debugfs info entry for device and resource information RDMA/bnxt_re: Fix incorrect errno used in function comments RDMA: Use %pe format specifier for error pointers RDMA/ionic: Use ether_addr_copy instead of memcpy RDMA/ionic: Fix build failure on SPARC due to xchg() operand size RDMA/rxe: Fix race in do_task() when draining IB/sa: Fix sa_local_svc_timeout_ms read race IB/ipoib: Ignore L3 master device RDMA/core: Use route entry flag to decide on loopback traffic RDMA/core: Resolve MAC of next-hop device without ARP support RDMA/core: Squash a single user static function RDMA/irdma: Update Kconfig RDMA/irdma: Extend CQE Error and Flush Handling for GEN3 Devices RDMA/irdma: Add Atomic Operations support ...
2025-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+1
Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-21RDMA: Use %pe format specifier for error pointersLeon Romanovsky4-13/+21
Convert error logging throughout the RDMA subsystem to use the %pe format specifier instead of PTR_ERR() with integer format specifiers. Link: https://patch.msgid.link/e81ec02df1e474be20417fb62e779776e3f47a50.1758217936.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-09-17net/mlx5: Store the global doorbell in mlx5_privCosmin Ratiu1-2/+2
The global doorbell is used for more than just Ethernet resources, so move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs. Use this opportunity to consolidate it with the 'uar' pointer already there, which was used as an RX doorbell. Underneath the 'uar' pointer is identical to 'bfreg->up', so store a single resource and use that instead. For CQ doorbells, care is taken to always use bfreg->up->index instead of bfreg->index, which may refer to a subsequent UAR page from the same ALLOC_UAR batch on some NICs. This paves the way for cleanly supporting multiple doorbells in the Ethernet driver. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11RDMA/mlx5: Fix page size bitmap calculation for KSM modeEdward Srouji1-0/+4
When using KSM (Key Scatter-gather Memory) access mode, the HW requires the IOVA to be aligned to the selected page size. Without this alignment, the HW may not function correctly. Currently, mlx5_umem_mkc_find_best_pgsz() does not filter out page sizes that would result in misaligned IOVAs for KSM mode. This can lead to selecting page sizes that are incompatible with the given IOVA. Fix this by filtering the page size bitmap when in KSM mode, keeping only page sizes to which the IOVA is aligned to. Fixes: fcfb03597b7d ("RDMA/mlx5: Align mkc page size capability check to PRM") Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20250824144839.154717-1-edwards@nvidia.com Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-08-25IB/mlx5: Fix obj_type mismatch for SRQ event subscriptionsOr Har-Toov1-0/+1
Fix a bug where the driver's event subscription logic for SRQ-related events incorrectly sets obj_type for RMP objects. When subscribing to SRQ events, get_legacy_obj_type() did not handle the MLX5_CMD_OP_CREATE_RMP case, which caused obj_type to be 0 (default). This led to a mismatch between the obj_type used during subscription (0) and the value used during notification (1, taken from the event's type field). As a result, event mapping for SRQ objects could fail and event notification would not be delivered correctly. This fix adds handling for MLX5_CMD_OP_CREATE_RMP in get_legacy_obj_type, returning MLX5_EVENT_QUEUE_TYPE_RQ so obj_type is consistent between subscription and notification. Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://patch.msgid.link/r/8f1048e3fdd1fde6b90607ce0ed251afaf8a148c.1755088962.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25RDMA/mlx5: Fix vport loopback forcing for MPV devicePatrisious Haddad2-4/+16
Previously loopback for MPV was supposed to be permanently enabled, however other driver flows were able to over-ride that configuration and disable it. Add force_lb parameter that indicates that loopback should always be enabled which prevents all other driver flows from disabling it. Fixes: a9a9e68954f2 ("RDMA/mlx5: Fix vport loopback for MPV device") Link: https://patch.msgid.link/r/cfc6b1f0f99f8100b087483cc14da6025317f901.1755088808.git.leon@kernel.org Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25RDMA/mlx5: Better estimate max_qp_wr to reflect WQE countOr Har-Toov1-1/+47
The mlx5 driver currently derives max_qp_wr directly from the log_max_qp_sz HCA capability: props->max_qp_wr = 1 << MLX5_CAP_GEN(mdev, log_max_qp_sz); However, this value represents the number of WQEs in units of Basic Blocks (see MLX5_SEND_WQE_BB), not actual number of WQEs. Since the size of a WQE can vary depending on transport type and features (e.g., atomic operations, UMR, LSO), the actual number of WQEs can be significantly smaller than the WQEBB count suggests. This patch introduces a conservative estimation of the worst-case WQE size — considering largest segments possible with 1 SGE and no inline data or special features. It uses this to derive a more accurate max_qp_wr value. Fixes: 938fe83c8dcb ("net/mlx5_core: New device capabilities handling") Link: https://patch.msgid.link/r/7d992c9831c997ed5c33d30973406dc2dcaf5e89.1755088725.git.leon@kernel.org Reported-by: Chuck Lever <cel@kernel.org> Closes: https://lore.kernel.org/all/20250506142202.GJ2260621@ziepe.ca/ Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-25RDMA/mlx5: Enable Data-Direct with Relaxed OrderingYishai Hadas4-7/+41
Relaxed Ordering can improve performance in certain scenarios. Enable it in the Data-Direct use case as well. Link: https://patch.msgid.link/r/1221dcdda8061ba5f6bc3519044083c7438b257e.1755088503.git.leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Gal Shalom <galshalom@Nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-08-15{rdma,net}/mlx5: export mlx5_vport_get_vhca_idSaeed Mahameed1-23/+4
vhca id is already cached in the vport structure no need to query on every mlx5 layer, use the mlx5_vport_get_vhca_id, where possible. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
2025-07-31Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds16-163/+689
Pull rdma updates from Jason Gunthorpe: - Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1, rxe, erdma, mana_ib - Prefetch supprot for rxe ODP - Remove memory window support from hns as new device FW is no longer support it - Remove qib, it is very old and obsolete now, Cornelis wishes to restructure the hfi1/qib shared layer - Fix a race in destroying CQs where we can still end up with work running because the work is cancled before the driver stops triggering it - Improve interaction with namespaces: * Follow the devlink namespace for newly spawned RDMA devices * Create iopoib net devces in the parent IB device's namespace * Allow CAP_NET_RAW checks to pass in user namespaces - A new flow control scheme for IB MADs to try and avoid queue overflows in the network - Fix 2G message sizes in bnxt_re - Optimize mkey layout for mlx5 DMABUF - New "DMA Handle" concept to allow controlling PCI TPH and steering tags * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits) RDMA/siw: Change maintainer email address RDMA/mana_ib: add support of multiple ports RDMA/mlx5: Refactor optional counters steering code RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr IB: Extend UVERBS_METHOD_REG_MR to get DMAH RDMA/mlx5: Add DMAH object support RDMA/core: Introduce a DMAH object and its alloc/free APIs IB/core: Add UVERBS_METHOD_REG_MR on the MR object net/mlx5: Add support for device steering tag net/mlx5: Expose IFC bits for TPH PCI/TPH: Expose pcie_tph_get_st_table_size() RDMA/mlx5: Fix incorrect MKEY masking RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey() RDMA/mlx5: remove redundant check on err on return expression RDMA/mana_ib: add additional port counters RDMA/mana_ib: Fix DSCP value in modify QP RDMA/efa: Add CQ with external memory support RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers RDMA/uverbs: Add a common way to create CQ with umem RDMA/mlx5: Optimize DMABUF mkey page size ...
2025-07-23RDMA/mlx5: Refactor optional counters steering codePatrisious Haddad4-62/+67
Currently there isn't a clear layer separation between the counters and the steering code, whereas the steering code is doing redundant access to the counter struct. Separate the fs.c and counters.c, where fs code won't access or be aware of counter structs but only the objects it needs. As a result, move mlx5_rdma_counter struct from the header file to be an internal struct for the counters file only. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/9854d1fdb140e4a6552b7a2fd1a028cfe488a935.1753004208.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mrYishai Hadas4-20/+94
As part of this enhancement, allow the creation of an MKEY associated with a DMA handle. Additional notes: MKEYs with TPH (i.e. TLP Processing Hints) attributes are currently not UMR-capable unless explicitly enabled by firmware or hardware. Therefore, to maintain such MKEYs in the MR cache, the TPH fields have been added to the rb_key structure, with a dedicated hash bucket. The ability to bypass the kernel verbs flow and create an MKEY with TPH attributes using DEVX has been restricted. TPH must follow the standard InfiniBand flow, where a DMAH is created with the appropriate security checks and management mechanisms in place. DMA handles are currently not supported in conjunction with On-Demand Paging (ODP). Re-registration of memory regions originally created with TPH attributes is currently not supported. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/1c485651cf8417694ddebb80446c5093d5a791a9.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23IB: Extend UVERBS_METHOD_REG_MR to get DMAHYishai Hadas2-3/+7
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers. It will be used in mlx5 driver as part of the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23RDMA/mlx5: Add DMAH object supportYishai Hadas4-0/+83
This patch introduces support for allocating and deallocating the DMAH object. Further details: ---------------- The DMAH API is exposed to upper layers only if the underlying device supports TPH. It uses the mlx5_core steering tag (ST) APIs to get a steering tag index based on the provided input. The obtained index is stored in the device-specific mlx5_dmah structure for future use. Upcoming patches in the series will integrate the allocated DMAH into the memory region (MR) registration process. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/778550776799d82edb4d05da249a1cff00160b50.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-21RDMA/mlx5: Fix incorrect MKEY maskingLeon Romanovsky1-1/+2
mkey_mask is __be64 type, while MLX5_MKEY_MASK_PAGE_SIZE is declared as unsigned long long. This causes to the static checkers errors: drivers/infiniband/hw/mlx5/umr.c:663:49: warning: invalid assignment: &= drivers/infiniband/hw/mlx5/umr.c:663:49: left side has type restricted __be64 drivers/infiniband/hw/mlx5/umr.c:663:49: right side has type int Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size") Link: https://patch.msgid.link/e354d70b98dfa5ecf4c236a36cd36b64add9d9de.1753003467.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-21RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()Leon Romanovsky1-14/+14
As Colin reported: "The variable zapped_blocks is a size_t type and is being assigned a int return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an unsigned type, the error check for zapped_blocks < 0 will never be true." So separate return error and nblocks assignment. Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size") Reported-by: Colin King (gmail) <colin.i.king@gmail.com> Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-17RDMA/mlx5: remove redundant check on err on return expressionColin Ian King1-1/+1
Currently all paths that set err and then check it for an error perform immediate returns, hence err always zero at the end of the function _mlx5r_umr_zap_mkey. The return expression err ? err : nblocks has a redundant check on the err since err is always zero, so just return nblocks instead. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://patch.msgid.link/20250717112108.4036171-1-colin.i.king@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-14Merge branch 'mlx5-next' of ↵Jakub Kicinski1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-14 * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: IFC updates for disabled host PF net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc RDMA/mlx5: Fix UMR modifying of mkey page size net/mlx5: Expose HCA capability bits for mkey max page size ==================== Link: https://patch.msgid.link/1752481357-34780-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-13RDMA/mlx5: Optimize DMABUF mkey page sizeEdward Srouji4-43/+327
The current implementation of DMABUF memory registration uses a fixed page size for the memory key (mkey), which can lead to suboptimal performance when the underlying memory layout may offer better page size. The optimization improves performance by reducing the number of page table entries required for the mkey, leading to less MTT/KSM descriptors that the HCA must go through to find translations, fewer cache-lines, and shorter UMR work requests on mkey updates such as when re-registering or reusing a cacheable mkey. To ensure safe page size updates, the implementation uses a 5-step process: 1. Make the first X entries non-present, while X is calculated to be minimal according to a large page shift that can be used to cover the MR length. 2. Update the page size to the large supported page size 3. Load the remaining N-X entries according to the (optimized) page shift 4. Update the page size according to the (optimized) page shift 5. Load the first X entries with the correct translations This ensures that at no point is the MR accessible with a partially updated translation table, maintaining correctness and preventing access to stale or inconsistent mappings, such as having an mkey advertising the new page size while some of the underlying page table entries still contain the old page size translations. Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/bc05a6b2142c02f96a90635f9a4458ee4bbbf39f.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mlx5: Align mkc page size capability check to PRMMichael Guralnik2-9/+52
Align the capabilities checked when using the log_page_size 6th bit in the mkey context to the PRM definition. The upper and lower bounds are set by max/min caps, and modification of the 6th bit by UMR is allowed only when a specific UMR cap is set. Current implementation falsely assumes all page sizes up-to 2^63 are supported when the UMR cap is set. In case the upper bound cap is lower than 63, this might result a FW syndrome on mkey creation, e.g: mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22) Previous cap enforcement is still correct for all current HW, FW and driver combinations. However, this patch aligns the code to be PRM compliant in the general case. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/eab4eeb4785105a4bb5eb362dc0b3662cd840412.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13Optimize DMABUF mkey page size in mlx5Leon Romanovsky1-2/+4
From Edward: This patch series enables the mlx5 driver to dynamically choose the optimal page size for a DMABUF-based memory key (mkey), rather than always registering with a fixed page size. Previously, DMABUF memory registration used a fixed 4K page size for mkeys which could lead to suboptimal performance when the underlying memory layout may offer better page sizes. This approach did not take advantage of larger page size capabilities advertised by the HCA, and the driver was not setting the proper page size mask in the mkey mask when performing page size changes, potentially leading to invalid registrations when updating to a very large pages. This series improves DMABUF performance by dynamically selecting the best page size for a given memory region (MR) both at creation time and on page fault occurrences, based on the underlying layout and fixing related gaps and bugs. By doing so, we reduce the number of page table entries (and thus MTT/ KSM descriptors) that the HCA must traverse, which in turn reduces cache-line fetches. Thanks * mlx5-next: RDMA/mlx5: Fix UMR modifying of mkey page size net/mlx5: Expose HCA capability bits for mkey max page size Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mlx5: Fix UMR modifying of mkey page sizeEdward Srouji1-2/+4
When changing the page size on an mkey, the driver needs to set the appropriate bits in the mkey mask to indicate which fields are being modified. The 6th bit of a page size in mlx5 driver is considered an extension, and this bit has a dedicated capability and mask bits. Previously, the driver was not setting this mask in the mkey mask when performing page size changes, regardless of its hardware support, potentially leading to an incorrect page size updates. This fixes the issue by setting the relevant bit in the mkey mask when performing page size changes on an mkey and the 6th bit of this field is supported by the hardware. Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits") Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-08Merge branch 'mlx5-next' of ↵Jakub Kicinski1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-08 The following pull-request contains common mlx5 updates for your *net-next* tree. v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain ==================== Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07Merge branch 'mlx5-next' into wip/leon-for-nextLeon Romanovsky1-1/+1
* mlx5-next: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities
2025-07-02net/mlx5: Check device memory pointer before usageStav Aviram1-1/+1
Add a NULL check before accessing device memory to prevent a crash if dev->dm allocation in mlx5_init_once() fails. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Stav Aviram <saviram@nvidia.com> Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx createParav Pandit1-1/+1
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the devx object. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: a8b92ca1b0e5 ("IB/mlx5: Introduce DEVX") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor createParav Pandit1-1/+1
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the anchor. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow createParav Pandit1-1/+1
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 322694412400 ("IB/mlx5: Introduce driver create and destroy flow methods") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA/mlx5: Allocate IB device with net namespace supplied from core devMark Bloch2-3/+6
Use the new ib_alloc_device_with_net() API to allocate the IB device so that it is properly bound to the network namespace obtained via mlx5_core_net(). This change ensures correct namespace association (e.g., for containerized setups). Additionally, expose mlx5_core_net so that RDMA driver can use it. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-25RDMA/mlx5: Add multiple priorities support to RDMA TRANSPORT userspace tablesPatrisious Haddad3-19/+33
Support the creation of RDMA TRANSPORT tables over multiple priorities via matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Support driver APIs pre_destroy_cq and post_destroy_cqMark Zhang3-4/+19
- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would be generated; - post_destroy_cq: Release all resources. This patch, along with last one, fixes the crash below. Unable to handle kernel paging request at virtual address ffff8000114b1180 Mem abort info: ESR = 0x96000047 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000047 CM = 0, WnR = 1 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000 [ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000 Internal error: Oops: 96000047 [#1] SMP Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf] CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G OE K 5.10.84-004.ali5000.alios7.aarch64 #1 Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--) pc : native_queued_spin_lock_slowpath+0x1c4/0x31c lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib] sp : ffff80002be1bc80 x29: ffff80002be1bc80 x28: ffff000810e69000 x27: ffff000810e69000 x26: ffff000810e69200 x25: 0000000000000000 x24: ffff8000117db000 x23: ffff04000156b780 x22: 0000000000000000 x21: ffff04000ce6c160 x20: ffff0008196a4000 x19: 0000000000000010 x18: 0000000000000020 x17: 0000000000000000 x16: 0000000000000000 x15: ffff040055a364e8 x14: ffffffffffffffff x13: ffff80002318bda8 x12: ffff0400358836e8 x11: 0000000000000040 x10: 0000000000000eb0 x9 : 0000000000000000 x8 : 0000000000000000 x7 : ffff04477fa20140 x6 : ffff8000114b1140 x5 : ffff04477fa20140 x4 : ffff8000114b1180 x3 : ffff000810e69200 x2 : ffff8000114b1180 x1 : 0000000001500000 x0 : ffff04477fa20148 Call trace: native_queued_spin_lock_slowpath+0x1c4/0x31c __ib_process_cq+0x74/0x1b8 [ib_core] ib_cq_poll_work+0x34/0xa0 [ib_core] process_one_work+0x1d8/0x4b0 worker_thread+0x180/0x440 kthread+0x114/0x120 Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847) ---[ end trace 387be2290557729c ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x9850817,7a60aa38 Memory Limit: none Starting crashdump kernel... Bye! Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix vport loopback for MPV devicePatrisious Haddad1-0/+33
Always enable vport loopback for both MPV devices on driver start. Previously in some cases related to MPV RoCE, packets weren't correctly executing loopback check at vport in FW, since it was disabled. Due to complexity of identifying such cases for MPV always enable vport loopback for both GVMIs when binding the slave to the master port. Fixes: 0042f9e458a5 ("RDMA/mlx5: Enable vport loopback when user context or QP mandate") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/d4298f5ebb2197459e9e7221c51ecd6a34699847.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix CC counters query for MPVPatrisious Haddad1-1/+1
In case, CC counters are querying for the second port use the correct core device for the query instead of always using the master core device. Fixes: aac4492ef23a ("IB/mlx5: Update counter implementation for dual port RoCE") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9cace74dcf106116118bebfa9146d40d4166c6b0.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix HW counters query for non-representor devicesPatrisious Haddad1-1/+1
To get the device HW counters, a non-representor switchdev device should use the mlx5_ib_query_q_counters() function and query all of the available counters. While a representor device in switchdev mode should use the mlx5_ib_query_q_counters_vport() function and query only the Q_Counters without the PPCNT counters and congestion control counters, since they aren't relevant for a representor device. Currently a non-representor switchdev device skips querying the PPCNT counters and congestion control counters, leaving them unupdated. Fix that by properly querying those counters for non-representor devices. Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/56bf8af4ca8c58e3fb9f7e47b1dca2009eeeed81.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25IB/mlx5: Fix potential deadlock in MR deregistrationOr Har-Toov1-14/+47
The issue arises when kzalloc() is invoked while holding umem_mutex or any other lock acquired under umem_mutex. This is problematic because kzalloc() can trigger fs_reclaim_aqcuire(), which may, in turn, invoke mmu_notifier_invalidate_range_start(). This function can lead to mlx5_ib_invalidate_range(), which attempts to acquire umem_mutex again, resulting in a deadlock. The problematic flow: CPU0 | CPU1 ---------------------------------------|------------------------------------------------ mlx5_ib_dereg_mr() | → revoke_mr() | → mutex_lock(&umem_odp->umem_mutex) | | mlx5_mkey_cache_init() | → mutex_lock(&dev->cache.rb_lock) | → mlx5r_cache_create_ent_locked() | → kzalloc(GFP_KERNEL) | → fs_reclaim() | → mmu_notifier_invalidate_range_start() | → mlx5_ib_invalidate_range() | → mutex_lock(&umem_odp->umem_mutex) → cache_ent_find_and_store() | → mutex_lock(&dev->cache.rb_lock) | Additionally, when kzalloc() is called from within cache_ent_find_and_store(), we encounter the same deadlock due to re-acquisition of umem_mutex. Solve by releasing umem_mutex in dereg_mr() after umr_revoke_mr() and before acquiring rb_lock. This ensures that we don't hold umem_mutex while performing memory allocations that could trigger the reclaim path. This change prevents the deadlock by ensuring proper lock ordering and avoiding holding locks during memory allocation operations that could trigger the reclaim path. The following lockdep warning demonstrates the deadlock: python3/20557 is trying to acquire lock: ffff888387542128 (&umem_odp->umem_mutex){+.+.}-{4:4}, at: mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] but task is already holding lock: ffffffff82f6b840 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: unmap_vmas+0x7b/0x1a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}: fs_reclaim_acquire+0x60/0xd0 mem_cgroup_css_alloc+0x6f/0x9b0 cgroup_init_subsys+0xa4/0x240 cgroup_init+0x1c8/0x510 start_kernel+0x747/0x760 x86_64_start_reservations+0x25/0x30 x86_64_start_kernel+0x73/0x80 common_startup_64+0x129/0x138 -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x91/0xd0 __kmalloc_cache_noprof+0x4d/0x4c0 mlx5r_cache_create_ent_locked+0x75/0x620 [mlx5_ib] mlx5_mkey_cache_init+0x186/0x360 [mlx5_ib] mlx5_ib_stage_post_ib_reg_umr_init+0x3c/0x60 [mlx5_ib] __mlx5_ib_add+0x4b/0x190 [mlx5_ib] mlx5r_probe+0xd9/0x320 [mlx5_ib] auxiliary_bus_probe+0x42/0x70 really_probe+0xdb/0x360 __driver_probe_device+0x8f/0x130 driver_probe_device+0x1f/0xb0 __driver_attach+0xd4/0x1f0 bus_for_each_dev+0x79/0xd0 bus_add_driver+0xf0/0x200 driver_register+0x6e/0xc0 __auxiliary_driver_register+0x6a/0xc0 do_one_initcall+0x5e/0x390 do_init_module+0x88/0x240 init_module_from_file+0x85/0xc0 idempotent_init_module+0x104/0x300 __x64_sys_finit_module+0x68/0xc0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #1 (&dev->cache.rb_lock){+.+.}-{4:4}: __mutex_lock+0x98/0xf10 __mlx5_ib_dereg_mr+0x6f2/0x890 [mlx5_ib] mlx5_ib_dereg_mr+0x21/0x110 [mlx5_ib] ib_dereg_mr_user+0x85/0x1f0 [ib_core] uverbs_free_mr+0x19/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x80 [ib_uverbs] uverbs_destroy_uobject+0x60/0x3d0 [ib_uverbs] uobj_destroy+0x57/0xa0 [ib_uverbs] ib_uverbs_cmd_verbs+0x4d5/0x1210 [ib_uverbs] ib_uverbs_ioctl+0x129/0x230 [ib_uverbs] __x64_sys_ioctl+0x596/0xaa0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (&umem_odp->umem_mutex){+.+.}-{4:4}: __lock_acquire+0x1826/0x2f00 lock_acquire+0xd3/0x2e0 __mutex_lock+0x98/0xf10 mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] __mmu_notifier_invalidate_range_start+0x18e/0x1f0 unmap_vmas+0x182/0x1a0 exit_mmap+0xf3/0x4a0 mmput+0x3a/0x100 do_exit+0x2b9/0xa90 do_group_exit+0x32/0xa0 get_signal+0xc32/0xcb0 arch_do_signal_or_restart+0x29/0x1d0 syscall_exit_to_user_mode+0x105/0x1d0 do_syscall_64+0x79/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Chain exists of: &dev->cache.rb_lock --> mmu_notifier_invalidate_range_start --> &umem_odp->umem_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&umem_odp->umem_mutex); lock(mmu_notifier_invalidate_range_start); lock(&umem_odp->umem_mutex); lock(&dev->cache.rb_lock); *** DEADLOCK *** Fixes: abb604a1a9c8 ("RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/3c8f225a8a9fade647d19b014df1172544643e4a.1750061612.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-17RDMA/mlx5: Initialize obj_event->obj_sub_list before xa_insertMark Zhang1-1/+1
The obj_event may be loaded immediately after inserted, then if the list_head is not initialized then we may get a poisonous pointer. This fixes the crash below: mlx5_core 0000:03:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.4: firmware version: 32.38.3056 mlx5_core 0000:03:00.0 en3f0pf0sf2002: renamed from eth0 mlx5_core.sf mlx5_core.sf.4: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps IPv6: ADDRCONF(NETDEV_CHANGE): en3f0pf0sf2002: link becomes ready Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000007760fb000 [0000000000000060] pgd=000000076f6d7003, p4d=000000076f6d7003, pud=0000000777841003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] SMP Modules linked in: ipmb_host(OE) act_mirred(E) cls_flower(E) sch_ingress(E) mptcp_diag(E) udp_diag(E) raw_diag(E) unix_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) bonding(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) isofs(E) cdrom(E) mst_pciconf(OE) ib_umad(OE) mlx5_ib(OE) ipmb_dev_int(OE) mlx5_core(OE) kpatch_15237886(OEK) mlxdevm(OE) auxiliary(OE) ib_uverbs(OE) ib_core(OE) psample(E) mlxfw(OE) tls(E) sunrpc(E) vfat(E) fat(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) sbsa_gwdt(E) virtio_console(E) ext4(E) mbcache(E) jbd2(E) xfs(E) libcrc32c(E) mmc_block(E) virtio_net(E) net_failover(E) failover(E) sha2_ce(E) sha256_arm64(E) nvme(OE) nvme_core(OE) gpio_mlxbf3(OE) mlx_compat(OE) mlxbf_pmc(OE) i2c_mlxbf(OE) sdhci_of_dwcmshc(OE) pinctrl_mlxbf3(OE) mlxbf_pka(OE) gpio_generic(E) i2c_core(E) mmc_core(E) mlxbf_gige(OE) vitesse(E) pwr_mlxbf(OE) mlxbf_tmfifo(OE) micrel(E) mlxbf_bootctl(OE) virtio_ring(E) virtio(E) ipmi_devintf(E) ipmi_msghandler(E) [last unloaded: mst_pci] CPU: 11 PID: 20913 Comm: rte-worker-11 Kdump: loaded Tainted: G OE K 5.10.134-13.1.an8.aarch64 #1 Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.2.2.12968 Oct 26 2023 pstate: a0400089 (NzCv daIf +PAN -UAO -TCO BTYPE=--) pc : dispatch_event_fd+0x68/0x300 [mlx5_ib] lr : devx_event_notifier+0xcc/0x228 [mlx5_ib] sp : ffff80001005bcf0 x29: ffff80001005bcf0 x28: 0000000000000001 x27: ffff244e0740a1d8 x26: ffff244e0740a1d0 x25: ffffda56beff5ae0 x24: ffffda56bf911618 x23: ffff244e0596a480 x22: ffff244e0596a480 x21: ffff244d8312ad90 x20: ffff244e0596a480 x19: fffffffffffffff0 x18: 0000000000000000 x17: 0000000000000000 x16: ffffda56be66d620 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000040 x10: ffffda56bfcafb50 x9 : ffffda5655c25f2c x8 : 0000000000000010 x7 : 0000000000000000 x6 : ffff24545a2e24b8 x5 : 0000000000000003 x4 : ffff80001005bd28 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff244e0596a480 x0 : ffff244d8312ad90 Call trace: dispatch_event_fd+0x68/0x300 [mlx5_ib] devx_event_notifier+0xcc/0x228 [mlx5_ib] atomic_notifier_call_chain+0x58/0x80 mlx5_eq_async_int+0x148/0x2b0 [mlx5_core] atomic_notifier_call_chain+0x58/0x80 irq_int_handler+0x20/0x30 [mlx5_core] __handle_irq_event_percpu+0x60/0x220 handle_irq_event_percpu+0x3c/0x90 handle_irq_event+0x58/0x158 handle_fasteoi_irq+0xfc/0x188 generic_handle_irq+0x34/0x48 ... Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://patch.msgid.link/r/3ce7f20e0d1a03dc7de6e57494ec4b8eaf1f05c2.1750147949.git.leon@kernel.org Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/mlx5: Fix unsafe xarray access in implicit ODP handlingOr Har-Toov1-4/+4
__xa_store() and __xa_erase() were used without holding the proper lock, which led to a lockdep warning due to unsafe RCU usage. This patch replaces them with xa_store() and xa_erase(), which perform the necessary locking internally. ============================= WARNING: suspicious RCPU usage 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Not tainted ----------------------------- ./include/linux/xarray.h:1211 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/u136:0/219: at: process_one_work+0xbe4/0x15f0 process_one_work+0x75c/0x15f0 pagefault_mr+0x9a5/0x1390 [mlx5_ib] stack backtrace: CPU: 14 UID: 0 PID: 219 Comm: kworker/u136:0 Not tainted 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib] Call Trace: dump_stack_lvl+0xa8/0xc0 lockdep_rcu_suspicious+0x1e6/0x260 xas_create+0xb8a/0xee0 xas_store+0x73/0x14c0 __xa_store+0x13c/0x220 ? xa_store_range+0x390/0x390 ? spin_bug+0x1d0/0x1d0 pagefault_mr+0xcb5/0x1390 [mlx5_ib] ? _raw_spin_unlock+0x1f/0x30 mlx5_ib_eqe_pf_action+0x3be/0x2620 [mlx5_ib] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? mlx5_ib_invalidate_range+0xcb0/0xcb0 [mlx5_ib] process_one_work+0x7db/0x15f0 ? pwq_dec_nr_in_flight+0xda0/0xda0 ? assign_work+0x168/0x240 worker_thread+0x57d/0xcd0 ? rescuer_thread+0xc40/0xc40 kthread+0x3b3/0x800 ? kthread_is_per_cpu+0xb0/0xb0 ? lock_downgrade+0x680/0x680 ? do_raw_spin_lock+0x12d/0x270 ? spin_bug+0x1d0/0x1d0 ? finish_task_switch.isra.0+0x284/0x9e0 ? lockdep_hardirqs_on_prepare+0x284/0x400 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork+0x2d/0x70 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork_asm+0x11/0x20 Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free") Link: https://patch.msgid.link/r/a85ddd16f45c8cb2bc0a188c2b0fcedfce975eb8.1750061791.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-12RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanupArnd Bergmann1-1/+7
This function has an array of eight mlx5_async_cmd structures, which often fits on the stack, but depending on the configuration can end up blowing the stack frame warning limit: drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than] Change this to a dynamic allocation instead. While a kmalloc() can theoretically fail, a GFP_KERNEL allocation under a page will block until memory has been freed up, so in the worst case, this only adds extra time in an already constrained environment. Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250610092846.2642535-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar1-1/+1
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-26Merge tag 'v6.15' into rdma.git for-nextJason Gunthorpe1-2/+0
Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-25RDMA/mlx5: Avoid flexible array warningLeon Romanovsky1-37/+21
The following warning is reported by sparse tool: drivers/infiniband/hw/mlx5/fs.c:1664:26: warning: array of flexible structures Avoid it by simply splitting array into two separate structs. Link: https://patch.msgid.link/7b891b96a9fc053d01284c184d25ae98d35db2d4.1747827041.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-18RDMA/mlx5: Add support for 200Gbps per lane speedsPatrisious Haddad1-0/+12
Add support for 200Gbps per lane speeds speed when querying PTYS and report it back correctly when needed. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/b842d2f523e9b82e221378c444ebd5860d612959.1747134197.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-18RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stageYishai Hadas2-18/+0
The MLX5_IB_STAGE_UAR stage in the RDMA driver is redundant and should be removed. Responsibility for initializing the device's UAR pointer (mdev->priv.uar) lies with mlx5_core, which already sets it during the mlx5_load() process. At present, the RDMA UAR stage overwrites this pointer, which was correctly initialized by mlx5_core, creating the risk of inconsistency. Ownership and management of the UAR pointer should remain exclusively within mlx5_core. In the current upstream code, we luckily receive the same pointer, since mlx5_get_uars_page() still finds available BF registers for that UAR, allowing it to be shared. However, future changes in mlx5_core may expose this flaw. For instance, if mlx5_alloc_bfreg() is invoked twice before the RDMA UAR stage runs, the RDMA driver may overwrite the UAR allocated by mlx5_core. This could lead to real bugs. For example, if mlx5_ib is unloaded (rmmod), it might free the UAR, leaving mlx5_core with a dangling reference to an invalid UAR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Fan Li <fanl@nvidia.com> Link: https://patch.msgid.link/feaa84ec6f20468b4935c439923e9266122a93d0.1747134130.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkageLeon Romanovsky3-19/+44
Reuse newly added DMA API to cache IOVA and only link/unlink pages in fast path for UMEM ODP flow. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12RDMA/umem: Store ODP access mask information in PFNLeon Romanovsky2-18/+20
As a preparation to remove dma_list, store access mask in PFN pointer and not in dma_addr_t. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-05RDMA/mlx5: Fix error flow upon firmware failure for RQ destructionPatrisious Haddad1-2/+28
Upon RQ destruction if the firmware command fails which is the last resource to be destroyed some SW resources were already cleaned regardless of the failure. Now properly rollback the object to its original state upon such failure. In order to avoid a use-after free in case someone tries to destroy the object again, which results in the following kernel trace: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 37589 at lib/refcount.c:28 refcount_warn_saturate+0xf4/0x148 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) rfkill mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) psample mlxfw(OE) mlx_compat(OE) macsec tls pci_hyperv_intf sunrpc vfat fat virtio_net net_failover failover fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_console virtio_gpu virtio_blk virtio_dma_buf virtio_mmio dm_mirror dm_region_hash dm_log dm_mod xpmem(OE) CPU: 0 UID: 0 PID: 37589 Comm: python3 Kdump: loaded Tainted: G OE ------- --- 6.12.0-54.el10.aarch64 #1 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : refcount_warn_saturate+0xf4/0x148 lr : refcount_warn_saturate+0xf4/0x148 sp : ffff80008b81b7e0 x29: ffff80008b81b7e0 x28: ffff000133d51600 x27: 0000000000000001 x26: 0000000000000000 x25: 00000000ffffffea x24: ffff00010ae80f00 x23: ffff00010ae80f80 x22: ffff0000c66e5d08 x21: 0000000000000000 x20: ffff0000c66e0000 x19: ffff00010ae80340 x18: 0000000000000006 x17: 0000000000000000 x16: 0000000000000020 x15: ffff80008b81b37f x14: 0000000000000000 x13: 2e656572662d7265 x12: ffff80008283ef78 x11: ffff80008257efd0 x10: ffff80008283efd0 x9 : ffff80008021ed90 x8 : 0000000000000001 x7 : 00000000000bffe8 x6 : c0000000ffff7fff x5 : ffff0001fb8e3408 x4 : 0000000000000000 x3 : ffff800179993000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000133d51600 Call trace: refcount_warn_saturate+0xf4/0x148 mlx5_core_put_rsc+0x88/0xa0 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x64/0x98 [mlx5_ib] mlx5_ib_destroy_wq+0x34/0x80 [mlx5_ib] ib_destroy_wq_user+0x30/0xc0 [ib_core] uverbs_free_wq+0x28/0x58 [ib_uverbs] destroy_hw_idr_uobject+0x34/0x78 [ib_uverbs] uverbs_destroy_uobject+0x48/0x240 [ib_uverbs] __uverbs_cleanup_ufile+0xd4/0x1a8 [ib_uverbs] uverbs_destroy_ufile_hw+0x48/0x120 [ib_uverbs] ib_uverbs_close+0x2c/0x100 [ib_uverbs] __fput+0xd8/0x2f0 __fput_sync+0x50/0x70 __arm64_sys_close+0x40/0x90 invoke_syscall.constprop.0+0x74/0xd0 do_el0_svc+0x48/0xe8 el0_svc+0x44/0x1d0 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x1a4/0x1a8 Fixes: e2013b212f9f ("net/mlx5_core: Add RQ and SQ event handling") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/3181433ccdd695c63560eeeb3f0c990961732101.1745839855.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-07RDMA/mlx5: Fix compilation warning when USER_ACCESS isn't setMark Bloch1-2/+0
The cited commit made fs.c always compile, even when INFINIBAND_USER_ACCESS isn't set. This results in a compilation warning about an unused object when compiling with W=1 and USER_ACCESS is unset. Fix this by defining uverbs_destroy_def_handler() even when USER_ACCESS isn't set. Fixes: 36e0d433672f ("RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config") Link: https://patch.msgid.link/r/20250402070944.1022093-1-mbloch@nvidia.com Signed-off-by: Mark Bloch <mbloch@nvidia.com> Tested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-07RDMA/mlx5: convert timeouts to secs_to_jiffies()Easwar Hariharan1-3/+3
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Link: https://patch.msgid.link/r/20250219-rdma-secs-to-jiffies-v1-2-b506746561a9@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner1-1/+1
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>