linux - Mirror of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/

Age	Commit message (Collapse)	Author	Files	Lines
2025-09-15	x86/bugs: Use early_param() for spectre_v2	David Kaplan	1	-99/+82
	Most of the mitigations in bugs.c use early_param() for command line parsing. Rework the spectre_v2 and nospectre_v2 command line options to be consistent with the others. Remove spec_v2_print_cond() as informing the user of the their cmdline choice isn't interesting. [ bp: Zap spectre_v2_check_cmd(). ] Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250819192200.2003074-3-david.kaplan@amd.com
2025-09-15	x86/bugs: Use early_param() for spectre_v2_user	David Kaplan	1	-42/+26
	Most of the mitigations in bugs.c use early_param() to parse their command line options. Modify spectre_v2_user to use early_param() for consistency. Remove spec_v2_user_print_cond() because informing a user about their cmdline choice isn't very interesting and the chosen mitigation is already printed in spectre_v2_user_update_mitigation(). Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250819192200.2003074-2-david.kaplan@amd.com
2025-09-15	x86/cpu: Detect FreeBSD Bhyve hypervisor	David Woodhouse	3	-0/+70
	Detect the Bhyve hypervisor and enable 15-bit MSI support if available. Detecting Bhyve used to be a purely cosmetic issue of the kernel printing 'Hypervisor detected: Bhyve' at boot time. But FreeBSD 15.0 will support¹ the 15-bit MSI enlightenment to support more than 255 vCPUs (http://david.woodhou.se/ExtDestId.pdf) which means there's now actually some functional reason to do so. ¹ https://github.com/freebsd/freebsd-src/commit/313a68ea20b4 [ bp: Massage, move tail comment ontop. ] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ahmed S. Darwish <darwi@linutronix.de> Link: https://lore.kernel.org/03802f6f7f5b5cf8c5e8adfe123c397ca8e21093.camel@infradead.org
2025-09-15	uprobes/x86: Return error from uprobe syscall when not called from trampoline	Jiri Olsa	1	-1/+1
	Currently uprobe syscall handles all errors with forcing SIGILL to current process. As suggested by Andrii it'd be helpful for uprobe syscall detection to return error value for the !in_uprobe_trampoline check. This way we could just call uprobe syscall and based on return value we will find out if the kernel has it. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com>
2025-09-15	x86/resctrl: Configure mbm_event mode if supported	Babu Moger	3	-0/+16
	Configure mbm_event mode on AMD platforms. On AMD platforms, it is recommended to use the mbm_event mode, if supported, to prevent the hardware from resetting counters between reads. This can result in misleading values or display "Unavailable" if no counter is assigned to the event. Enable mbm_event mode, known as ABMC (Assignable Bandwidth Monitoring Counters) on AMD, by default if the system supports it. Update ABMC across all logical processors within the resctrl domain to ensure proper functionality. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86/resctrl: Implement resctrl_arch_reset_cntr() and resctrl_arch_cntr_read()	Babu Moger	2	-0/+75
	System software reads resctrl event data for a particular resource by writing the RMID and Event Identifier (EvtID) to the QM_EVTSEL register and then reading the event data from the QM_CTR register. In ABMC mode, the event data of a specific counter ID is read by setting the following fields: QM_EVTSEL.ExtendedEvtID = 1, QM_EVTSEL.EvtID = L3CacheABMC (=1) and setting QM_EVTSEL.RMID to the desired counter ID. Reading the QM_CTR then returns the contents of the specified counter ID. RMID_VAL_ERROR bit is set if the counter configuration is invalid, or if an invalid counter ID is set in the QM_EVTSEL.RMID field. RMID_VAL_UNAVAIL bit is set if the counter data is unavailable. Introduce resctrl_arch_reset_cntr() and resctrl_arch_cntr_read() to reset and read event data for a specific counter. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86/resctrl: Refactor resctrl_arch_rmid_read()	Babu Moger	1	-15/+23
	resctrl_arch_rmid_read() adjusts the value obtained from MSR_IA32_QM_CTR to account for the overflow for MBM events and apply counter scaling for all the events. This logic is common to both reading an RMID and reading a hardware counter directly. Refactor the hardware value adjustment logic into get_corrected_val() to prepare for support of reading a hardware counter. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86,fs/resctrl: Implement resctrl_arch_config_cntr() to assign a counter ↵	Babu Moger	1	-0/+36
	with ABMC The ABMC feature allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. Implement an x86 architecture-specific handler to configure a counter. This architecture specific handler is called by resctrl fs when a counter is assigned or unassigned as well as when an already assigned counter's configuration should be updated. Configure counters by writing to the L3_QOS_ABMC_CFG MSR, specifying the counter ID, bandwidth source (RMID), and event configuration. The ABMC feature details are documented in APM [1] available from [2]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
2025-09-15	x86/resctrl: Add data structures and definitions for ABMC assignment	Babu Moger	1	-0/+36
	The ABMC feature allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. The ABMC feature implements an MSR L3_QOS_ABMC_CFG (C000_03FDh). ABMC counter assignment is done by setting the counter id, bandwidth source (RMID) and bandwidth configuration. Attempts to read or write the MSR when ABMC is not enabled will result in a #GP(0) exception. Introduce the data structures and definitions for MSR L3_QOS_ABMC_CFG (0xC000_03FDh): ========================================================================= Bits Mnemonic Description Access Reset Type Value ========================================================================= 63 CfgEn Configuration Enable R/W 0 62 CtrEn Enable/disable counting R/W 0 61:53 – Reserved MBZ 0 52:48 CtrID Counter Identifier R/W 0 47 IsCOS BwSrc field is a CLOSID R/W 0 (not an RMID) 46:44 – Reserved MBZ 0 43:32 BwSrc Bandwidth Source R/W 0 (RMID or CLOSID) 31:0 BwType Bandwidth configuration R/W 0 tracked by the CtrID ========================================================================== The ABMC feature details are documented in APM [1] available from [2]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). [ bp: Touchups. ] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
2025-09-15	x86/resctrl: Add support to enable/disable AMD ABMC feature	Babu Moger	2	-0/+50
	Add the functionality to enable/disable the AMD ABMC feature. The AMD ABMC feature is enabled by setting enabled bit(0) in the L3_QOS_EXT_CFG MSR. When the state of ABMC is changed, the MSR needs to be updated on all the logical processors in the QOS Domain. Hardware counters will reset when ABMC state is changed. [ bp: Massage commit message. ] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
2025-09-15	x86,fs/resctrl: Detect Assignable Bandwidth Monitoring feature details	Babu Moger	2	-5/+13
	ABMC feature details are reported via CPUID Fn8000_0020_EBX_x5. Bits Description 15:0 MAX_ABMC Maximum Supported Assignable Bandwidth Monitoring Counter ID + 1 The ABMC feature details are documented in APM [1] available from [2]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). Detect the feature and number of assignable counters supported. For backward compatibility, upon detecting the assignable counter feature, enable the mbm_total_bytes and mbm_local_bytes events that users are familiar with as part of original L3 MBM support. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
2025-09-15	x86,fs/resctrl: Consolidate monitoring related data from rdt_resource	Babu Moger	2	-7/+7
	The cache allocation and memory bandwidth allocation feature properties are consolidated into struct resctrl_cache and struct resctrl_membw respectively. In preparation for more monitoring properties that will clobber the existing resource struct more, re-organize the monitoring specific properties to also be in a separate structure. Also convert "bandwidth sources" terminology to "memory transactions" to have consistency within resctrl for related monitoring features. [ bp: Massage commit message. ] Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86/resctrl: Add ABMC feature in the command line options	Babu Moger	1	-0/+2
	Add a kernel command-line parameter to enable or disable the exposure of the ABMC (Assignable Bandwidth Monitoring Counters) hardware feature to resctrl. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86/cpufeatures: Add support for Assignable Bandwidth Monitoring Counters (ABMC)	Babu Moger	1	-0/+1
	Users can create as many monitor groups as RMIDs supported by the hardware. However, the bandwidth monitoring feature on AMD only guarantees that RMIDs currently assigned to a processor will be tracked by hardware. The counters of any other RMIDs which are no longer being tracked will be reset to zero. The MBM event counters return "Unavailable" for the RMIDs that are not tracked by hardware. So, there can be only limited number of groups that can give guaranteed monitoring numbers. With ever changing configurations there is no way to definitely know which of these groups are being tracked during a particular time. Users do not have the option to monitor a group or set of groups for a certain period of time without worrying about RMID being reset in between. The ABMC feature allows users to assign a hardware counter to an RMID, event pair and monitor bandwidth usage as long as it is assigned. The hardware continues to track the assigned counter until it is explicitly unassigned by the user. There is no need to worry about counters being reset during this period. Additionally, the user can specify the type of memory transactions (e.g., reads, writes) for the counter to track. Without ABMC enabled, monitoring will work in current mode without assignment option. The Linux resctrl subsystem provides an interface that allows monitoring of up to two memory bandwidth events per group, selected from a combination of available total and local events. When ABMC is enabled, two events will be assigned to each group by default, in line with the current interface design. Users will also have the option to configure which types of memory transactions are counted by these events. Due to the limited number of available counters (32), users may quickly exhaust the available counters. If the system runs out of assignable ABMC counters, the kernel will report an error. In such cases, users will need to unassign one or more active counters to free up counters for new assignments. resctrl will provide options to assign or unassign events through the group-specific interface file. The feature is detected via CPUID_Fn80000020_EBX_x00 bit 5: ABMC (Assignable Bandwidth Monitoring Counters). The ABMC feature details are documented in APM [1] available from [2]. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming Publication # 24593 Revision 3.41 section 19.3.3.3 Assignable Bandwidth Monitoring (ABMC). [ bp: Massage commit message, fixup enumeration due to VMSCAPE ] Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [2]
2025-09-15	x86,fs/resctrl: Prepare for more monitor events	Tony Luck	3	-41/+43
	There's a rule in computer programming that objects appear zero, once, or many times. So code accordingly. There are two MBM events and resctrl is coded with a lot of if (local) do one thing if (total) do a different thing Change the rdt_mon_domain and rdt_hw_mon_domain structures to hold arrays of pointers to per event data instead of explicit fields for total and local bandwidth. Simplify by coding for many events using loops on which are enabled. Move resctrl_is_mbm_event() to <linux/resctrl.h> so it can be used more widely. Also provide a for_each_mbm_event_id() helper macro. Cleanup variable names in functions touched to consistently use "eventid" for those with type enum resctrl_event_id. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86/resctrl: Remove the rdt_mon_features global variable	Tony Luck	2	-9/+5
	rdt_mon_features is used as a bitmask of enabled monitor events. A monitor event's status is now maintained in mon_evt::enabled with all monitor events' mon_evt structures found in the filesystem's mon_event_all[] array. Remove the remaining uses of rdt_mon_features. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86,fs/resctrl: Replace architecture event enabled checks	Tony Luck	2	-4/+4
	The resctrl file system now has complete knowledge of the status of every event. So there is no need for per-event function calls to check. Replace each of the resctrl_arch_is_{event}enabled() calls with resctrl_is_mon_event_enabled(QOS_{EVENT}). No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-15	x86,fs/resctrl: Consolidate monitor event descriptions	Tony Luck	1	-3/+9
	There are currently only three monitor events, all associated with the RDT_RESOURCE_L3 resource. Growing support for additional events will be easier with some restructuring to have a single point in file system code where all attributes of all events are defined. Place all event descriptions into an array mon_event_all[]. Doing this has the beneficial side effect of removing the need for rdt_resource::evt_list. Add resctrl_event_id::QOS_FIRST_EVENT for a lower bound on range checks for event ids and as the starting index to scan mon_event_all[]. Drop the code that builds evt_list and change the two places where the list is scanned to scan mon_event_all[] instead using a new helper macro for_each_mon_event(). Architecture code now informs file system code which events are available with resctrl_enable_mon_event(). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fenghua Yu <fenghuay@nvidia.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/cover.1757108044.git.babu.moger@amd.com
2025-09-14	Merge tag 'x86-urgent-2025-09-14' of ↵	Linus Torvalds	1	-11/+14
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fix a CPU topology parsing bug on AMD guests, and address a lockdep warning in the resctrl filesystem" * tag 'x86-urgent-2025-09-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/resctrl: Eliminate false positive lockdep warning when reading SNC counters x86/cpu/topology: Always try cpu_parse_topology_ext() on AMD/Hygon
2025-09-13	x86/kexec: fix potential cmem->ranges out of memory	fuqiang wang	1	-4/+19
	In memmap_exclude_ranges(), elfheader will be excluded from crashk_res. In the current x86 architecture code, the elfheader is always allocated at crashk_res.start. It seems that there won't be a new split range. But it depends on the allocation position of elfheader in crashk_res. To avoid potential out of memory in future, add a extra slot. Otherwise loading the kdump kernel will fail because crash_exclude_mem_range will return -ENOMEM. random kexec_buf for passing dm crypt keys may cause a range split too, add another extra slot here. The similar issue also exists in fill_up_crash_elf_data(). The range to be excluded is [0, 1M], start (0) is special and will not appear in the middle of existing cmem->ranges[]. But in cast the low 1M could be changed in the future, add a extra slot too. Previous discussions: [1] https://lore.kernel.org/kexec/ZXk2oBf%2FT1Ul6o0c@MiWiFi-R3L-srv/ [2] https://lore.kernel.org/kexec/273284e8-7680-4f5f-8065-c5d780987e59@easystack.cn/ [3] https://lore.kernel.org/kexec/ZYQ6O%2F57sHAPxTHm@MiWiFi-R3L-srv/ Link: https://lkml.kernel.org/r/20250904093855.1180154-1-coxu@redhat.com Signed-off-by: fuqiang wang <fuqiang.wang@easystack.cn> Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Coiby Xu <coxu@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13	x86/crash: remove redundant 0 value initialization	Liao Yuanhong	1	-2/+0
	The crash_mem struct is already zeroed by vzalloc(). It's redundant to initialize cmem->nr_ranges to 0. Link: https://lkml.kernel.org/r/20250818123530.635234-1-liaoyuanhong@vivo.com Signed-off-by: Liao Yuanhong <liaoyuanhong@vivo.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Coiby Xu <coxu@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-13	x86/kexec: carry forward the boot DTB on kexec	Brian Mak	1	-3/+44
	Currently, the kexec_file_load syscall on x86 does not support passing a device tree blob to the new kernel. Some embedded x86 systems use device trees. On these systems, failing to pass a device tree to the new kernel causes a boot failure. To add support for this, we copy the behavior of ARM64 and PowerPC and copy the current boot's device tree blob for use in the new kernel. We do this on x86 by passing the device tree blob as a setup_data entry in accordance with the x86 boot protocol. This behavior is gated behind the KEXEC_FILE_FORCE_DTB flag. Link: https://lkml.kernel.org/r/20250805211527.122367-3-makb@juniper.net Signed-off-by: Brian Mak <makb@juniper.net> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Young <dyoung@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-12	x86/bugs: Add attack vector controls for VMSCAPE	David Kaplan	1	-4/+10
	Use attack vector controls to select whether VMSCAPE requires mitigation, similar to other bugs. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-09-11	x86/kvm: Prefer native qspinlock for dedicated vCPUs irrespective of PV_UNHALT	Li RongQing	1	-10/+10
	The commit b2798ba0b876 ("KVM: X86: Choose qspinlock when dedicated physical CPUs are available") states that when PV_DEDICATED=1 (vCPU has dedicated pCPU), qspinlock should be preferred regardless of PV_UNHALT. However, the current implementation doesn't reflect this: when PV_UNHALT=0, we still use virt_spin_lock() even with dedicated pCPUs. This is suboptimal because: 1. Native qspinlocks should outperform virt_spin_lock() for dedicated vCPUs irrespective of HALT exiting 2. virt_spin_lock() should only be preferred when vCPUs may be preempted (non-dedicated case) So reorder the PV spinlock checks to: 1. First handle dedicated pCPU case (disable virt_spin_lock_key) 2. Second check single CPU, and nopvspin configuration 3. Only then check PV_UNHALT support This ensures we always use native qspinlock for dedicated vCPUs, delivering pretty performance gains at high contention levels. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Wangyang Guo <wangyang.guo@intel.com> Link: https://lore.kernel.org/r/20250722110005.4988-1-lirongqing@baidu.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-11	x86/kvm: Make kvm_async_pf_task_wake() a local static helper	Sean Christopherson	1	-2/+1
	Make kvm_async_pf_task_wake() static and drop its export, as the symbol is only referenced from within kvm.c. No functional change intended. Link: https://lore.kernel.org/r/20250729153901.564123-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-11	x86/kvm: Force legacy PCI hole to UC when overriding MTRRs for TDX/SNP	Sean Christopherson	1	-2/+19
	When running as an SNP or TDX guest under KVM, force the legacy PCI hole, i.e. memory between Top of Lower Usable DRAM and 4GiB, to be mapped as UC via a forced variable MTRR range. In most KVM-based setups, legacy devices such as the HPET and TPM are enumerated via ACPI. ACPI enumeration includes a Memory32Fixed entry, and optionally a SystemMemory descriptor for an OperationRegion, e.g. if the device needs to be accessed via a Control Method. If a SystemMemory entry is present, then the kernel's ACPI driver will auto-ioremap the region so that it can be accessed at will. However, the ACPI spec doesn't provide a way to enumerate the memory type of SystemMemory regions, i.e. there's no way to tell software that a region must be mapped as UC vs. WB, etc. As a result, Linux's ACPI driver always maps SystemMemory regions using ioremap_cache(), i.e. as WB on x86. The dedicated device drivers however, e.g. the HPET driver and TPM driver, want to map their associated memory as UC or WC, as accessing PCI devices using WB is unsupported. On bare metal and non-CoCO, the conflicting requirements "work" as firmware configures the PCI hole (and other device memory) to be UC in the MTRRs. So even though the ACPI mappings request WB, they are forced to UC- in the kernel's tracking due to the kernel properly handling the MTRR overrides, and thus are compatible with the drivers' requested WC/UC-. With force WB MTRRs on SNP and TDX guests, the ACPI mappings get their requested WB if the ACPI mappings are established before the dedicated driver code attempts to initialize the device. E.g. if acpi_init() runs before the corresponding device driver is probed, ACPI's WB mapping will "win", and result in the driver's ioremap() failing because the existing WB mapping isn't compatible with the requested WC/UC-. E.g. when a TPM is emulated by the hypervisor (ignoring the security implications of relying on what is allegedly an untrusted entity to store measurements), the TPM driver will request UC and fail: [ 1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0 [ 1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12 Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC). E.g. tracing mapping requests for TPM TIS yields: Mapping TPM TIS with req_type = 0 WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460 Modules linked in: CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.16.0-rc7+ #2 VOLUNTARY Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025 RIP: 0010:memtype_reserve+0x2ab/0x460 __ioremap_caller+0x16d/0x3d0 ioremap_cache+0x17/0x30 x86_acpi_os_ioremap+0xe/0x20 acpi_os_map_iomem+0x1f3/0x240 acpi_os_map_memory+0xe/0x20 acpi_ex_system_memory_space_handler+0x273/0x440 acpi_ev_address_space_dispatch+0x176/0x4c0 acpi_ex_access_region+0x2ad/0x530 acpi_ex_field_datum_io+0xa2/0x4f0 acpi_ex_extract_from_field+0x296/0x3e0 acpi_ex_read_data_from_field+0xd1/0x460 acpi_ex_resolve_node_to_value+0x2ee/0x530 acpi_ex_resolve_to_value+0x1f2/0x540 acpi_ds_evaluate_name_path+0x11b/0x190 acpi_ds_exec_end_op+0x456/0x960 acpi_ps_parse_loop+0x27a/0xa50 acpi_ps_parse_aml+0x226/0x600 acpi_ps_execute_method+0x172/0x3e0 acpi_ns_evaluate+0x175/0x5f0 acpi_evaluate_object+0x213/0x490 acpi_evaluate_integer+0x6d/0x140 acpi_bus_get_status+0x93/0x150 acpi_add_single_object+0x43a/0x7c0 acpi_bus_check_add+0x149/0x3a0 acpi_bus_check_add_1+0x16/0x30 acpi_ns_walk_namespace+0x22c/0x360 acpi_walk_namespace+0x15c/0x170 acpi_bus_scan+0x1dd/0x200 acpi_scan_init+0xe5/0x2b0 acpi_init+0x264/0x5b0 do_one_initcall+0x5a/0x310 kernel_init_freeable+0x34f/0x4f0 kernel_init+0x1b/0x200 ret_from_fork+0x186/0x1b0 ret_from_fork_asm+0x1a/0x30 </TASK> The above traces are from a Google-VMM based VM, but the same behavior happens with a QEMU based VM that is modified to add a SystemMemory range for the TPM TIS address space. The only reason this doesn't cause problems for HPET, which appears to require a SystemMemory region, is because HPET gets special treatment via x86_init.timers.timer_init(), and so gets a chance to create its UC- mapping before acpi_init() clobbers things. Disabling the early call to hpet_time_init() yields the same behavior for HPET: [ 0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0 Hack around the ACPI gap by forcing the legacy PCI hole to UC when overriding the (virtual) MTRRs for CoCo guest, so that ioremap handling of MTRRs naturally kicks in and forces the ACPI mappings to be UC. Note, the requested/mapped memtype doesn't actually matter in terms of accessing the device. In practically every setup, legacy PCI devices are emulated by the hypervisor, and accesses are intercepted and handled as emulated MMIO, i.e. never access physical memory and thus don't have an effective memtype. Even in a theoretical setup where such devices are passed through by the host, i.e. point at real MMIO memory, it is KVM's (as the hypervisor) responsibility to force the memory to be WC/UC, e.g. via EPT memtype under TDX or real hardware MTRRs under SNP. Not doing so cannot work, and the hypervisor is highly motivated to do the right thing as letting the guest access hardware MMIO with WB would likely result in a variety of fatal #MCs. In other words, forcing the range to be UC is all about coercing the kernel's tracking into thinking that it has established UC mappings, so that the ioremap code doesn't reject mappings from e.g. the TPM driver and thus prevent the driver from loading and the device from functioning. Note #2, relying on guest firmware to handle this scenario, e.g. by setting virtual MTRRs and then consuming them in Linux, is not a viable option, as the virtual MTRR state is managed by the untrusted hypervisor, and because OVMF at least has stopped programming virtual MTRRs when running as a TDX guest. Link: https://lore.kernel.org/all/8137d98e-8825-415b-9282-1d2a115bb51a@linux.intel.com Fixes: 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and TDX") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Jürgen Groß <jgross@suse.com> Cc: Korakit Seemakhupt <korakit@google.com> Cc: Jianxiong Gao <jxgao@google.com> Cc: Nikolay Borisov <nik.borisov@suse.com> Suggested-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Korakit Seemakhupt <korakit@google.com> Link: https://lore.kernel.org/r/20250828005249.39339-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-11	x86/mce: Add a clear_bank() helper	Yazen Ghannam	3	-5/+18
	Add a helper at the end of the MCA polling function to collect vendor and/or feature actions. Start with a basic skeleton for now. Actions for AMD thresholding and deferred errors will be added later. [ bp: Drop the obvious comment too. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Move machine_check_poll() status checks to helper functions	Yazen Ghannam	1	-40/+48
	There are a number of generic and vendor-specific status checks in machine_check_poll(). These are used to determine if an error should be skipped. Move these into helper functions. Future vendor-specific checks will be added to the helpers. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Separate global and per-CPU quirks	Yazen Ghannam	3	-62/+65
	Many quirks are global configuration settings and a handful apply to each CPU. Move the per-CPU quirks to vendor init to execute them on each online CPU. Set the global quirks during BSP-only init so they're only executed once and early. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Do 'UNKNOWN' vendor check early	Yazen Ghannam	1	-10/+8
	The 'UNKNOWN' vendor check is handled as a quirk that is run on each online CPU. However, all CPUs are expected to have the same vendor. Move the 'UNKNOWN' vendor check to the BSP-only init so it is done early and once. Remove the unnecessary return value from the quirks check. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Define BSP-only SMCA init	Yazen Ghannam	3	-0/+11
	Currently, on AMD systems, MCA interrupt handler functions are set during CPU init. However, the functions only need to be set once for the whole system. Assign the handlers only during BSP init. Do so only for SMCA systems to maintain the old behavior for legacy systems. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Define BSP-only init	Yazen Ghannam	3	-10/+22
	Currently, MCA initialization is executed identically on each CPU as they are brought online. However, a number of MCA initialization tasks only need to be done once. Define a function to collect all 'global' init tasks and call this from the BSP only. Start with CPU features. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-11	x86/mce: Set CR4.MCE last during init	Yazen Ghannam	1	-2/+3
	Set the CR4.MCE bit as the last step during init. This brings the MCA init order closer to what is described in the x86 docs. x86 docs: AMD Intel MCG_CTL MCA_CONFIG MCG_EXT_CTL MCi_CTL MCi_CTL MCG_CTL CR4.MCE CR4.MCE Current Linux: AMD Intel CR4.MCE CR4.MCE MCG_CTL MCG_CTL MCA_CONFIG MCG_EXT_CTL MCi_CTL MCi_CTL Updated Linux: AMD Intel MCG_CTL MCG_CTL MCA_CONFIG MCG_EXT_CTL MCi_CTL MCi_CTL CR4.MCE CR4.MCE The new init flow will match Intel's docs, but there will still be a mismatch for AMD regarding MCG_CTL. However, there is no known issue with this ordering, so leave it for now. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/20250908-wip-mca-updates-v6-0-eef5d6c74b9c@amd.com
2025-09-10	Merge tag 'vmscape-for-linus-20250904' of ↵	Linus Torvalds	2	-113/+258
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vmescape mitigation fixes from Dave Hansen: "Mitigate vmscape issue with indirect branch predictor flushes. vmscape is a vulnerability that essentially takes Spectre-v2 and attacks host userspace from a guest. It particularly affects hypervisors like QEMU. Even if a hypervisor may not have any sensitive data like disk encryption keys, guest-userspace may be able to attack the guest-kernel using the hypervisor as a confused deputy. There are many ways to mitigate vmscape using the existing Spectre-v2 defenses like IBRS variants or the IBPB flushes. This series focuses solely on IBPB because it works universally across vendors and all vulnerable processors. Further work doing vendor and model-specific optimizations can build on top of this if needed / wanted. Do the normal issue mitigation dance: - Add the CPU bug boilerplate - Add a list of vulnerable CPUs - Use IBPB to flush the branch predictors after running guests" * tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vmscape: Add old Intel CPUs to affected list x86/vmscape: Warn when STIBP is disabled with SMT x86/bugs: Move cpu_bugs_smt_update() down x86/vmscape: Enable the mitigation x86/vmscape: Add conditional IBPB mitigation x86/vmscape: Enumerate VMSCAPE bug Documentation/hw-vuln: Add VMSCAPE documentation
2025-09-08	clocksource: hyper-v: Skip unnecessary checks for the root partition	Wei Liu	1	-1/+10
	The HV_ACCESS_TSC_INVARIANT bit is always zero when Linux runs as the root partition. The root partition will see directly what the hardware provides. The old logic in ms_hyperv_init_platform caused the native TSC clock source to be incorrectly marked as unstable on x86. Fix it. Skip the unnecessary checks in code for the root partition. Add one extra comment in code to clarify the behavior. Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-08	x86/cpu/topology: Always try cpu_parse_topology_ext() on AMD/Hygon	K Prateek Nayak	1	-11/+14
	Support for parsing the topology on AMD/Hygon processors using CPUID leaf 0xb was added in 3986a0a805e6 ("x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available"). In an effort to keep all the topology parsing bits in one place, this commit also introduced a pseudo dependency on the TOPOEXT feature to parse the CPUID leaf 0xb. The TOPOEXT feature (CPUID 0x80000001 ECX[22]) advertises the support for Cache Properties leaf 0x8000001d and the CPUID leaf 0x8000001e EAX for "Extended APIC ID" however support for 0xb was introduced alongside the x2APIC support not only on AMD [1], but also historically on x86 [2]. Similar to 0xb, the support for extended CPU topology leaf 0x80000026 too does not depend on the TOPOEXT feature. The support for these leaves is expected to be confirmed by ensuring leaf <= {extended_}cpuid_level and then parsing the level 0 of the respective leaf to confirm EBX[15:0] (LogProcAtThisLevel) is non-zero as stated in the definition of "CPUID_Fn0000000B_EAX_x00 [Extended Topology Enumeration] (Core::X86::Cpuid::ExtTopEnumEax0)" in Processor Programming Reference (PPR) for AMD Family 19h Model 01h Rev B1 Vol1 [3] Sec. 2.1.15.1 "CPUID Instruction Functions". This has not been a problem on baremetal platforms since support for TOPOEXT (Fam 0x15 and later) predates the support for CPUID leaf 0xb (Fam 0x17[Zen2] and later), however, for AMD guests on QEMU, the "x2apic" feature can be enabled independent of the "topoext" feature where QEMU expects topology and the initial APICID to be parsed using the CPUID leaf 0xb (especially when number of cores > 255) which is populated independent of the "topoext" feature flag. Unconditionally call cpu_parse_topology_ext() on AMD and Hygon processors to first parse the topology using the XTOPOLOGY leaves (0x80000026 / 0xb) before using the TOPOEXT leaf (0x8000001e). While at it, break down the single large comment in parse_topology_amd() to better highlight the purpose of each CPUID leaf. Fixes: 3986a0a805e6 ("x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available") Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org # Only v6.9 and above; depends on x86 topology rewrite Link: https://lore.kernel.org/lkml/1529686927-7665-1-git-send-email-suravee.suthikulpanit@amd.com/ [1] Link: https://lore.kernel.org/lkml/20080818181435.523309000@linux-os.sc.intel.com/ [2] Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 [3]
2025-09-05	x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum	Kai Huang	1	-0/+16
	Some early TDX-capable platforms have an erratum: A kernel partial write (a write transaction of less than cacheline lands at memory controller) to TDX private memory poisons that memory, and a subsequent read triggers a machine check. On those platforms, the old kernel must reset TDX private memory before jumping to the new kernel, otherwise the new kernel may see unexpected machine check. Currently the kernel doesn't track which page is a TDX private page. For simplicity just fail kexec/kdump for those platforms. Leverage the existing machine_kexec_prepare() to fail kexec/kdump by adding the check of the presence of the TDX erratum (which is only checked for if the kernel is built with TDX host support). This rejects kexec/kdump when the kernel is loading the kexec/kdump kernel image. The alternative is to reject kexec/kdump when the kernel is jumping to the new kernel. But for kexec this requires adding a new check (e.g., arch_kexec_allowed()) in the common code to fail kernel_kexec() at early stage. Kdump (crash_kexec()) needs similar check, but it's hard to justify because crash_kexec() is not supposed to abort. It's feasible to further relax this limitation, i.e., only fail kexec when TDX is actually enabled by the kernel. But this is still a half measure compared to resetting TDX private memory so just do the simplest thing for now. The impact to userspace is the users will get an error when loading the kexec/kdump kernel image: kexec_load failed: Operation not supported This might be confusing to the users, thus also print the reason in the dmesg: [..] kexec: Not allowed on platform with tdx_pw_mce bug. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/all/20250901160930.1785244-5-pbonzini%40redhat.com
2025-09-05	x86/sme: Use percpu boolean to control WBINVD during kexec	Kai Huang	4	-20/+48
	TL;DR: Prepare to unify how TDX and SME do cache flushing during kexec by making a percpu boolean control whether to do the WBINVD. -- Background -- On SME platforms, dirty cacheline aliases with and without encryption bit can coexist, and the CPU can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property. TDX also needs a cache flush during kexec for the same reason. It would be good to have a generic way to flush the cache instead of scattering checks for each feature all around. When SME is enabled, the kernel basically encrypts all memory including the kernel itself and a simple memory write from the kernel could dirty cachelines. Currently, the kernel uses WBINVD to flush the cache for SME during kexec in two places: 1) the one in stop_this_cpu() for all remote CPUs when the kexec-ing CPU stops them; 2) the one in the relocate_kernel() where the kexec-ing CPU jumps to the new kernel. -- Solution -- Unlike SME, TDX can only dirty cachelines when it is used (i.e., when SEAMCALLs are performed). Since there are no more SEAMCALLs after the aforementioned WBINVDs, leverage this for TDX. To unify the approach for SME and TDX, use a percpu boolean to indicate the cache may be in an incoherent state and needs flushing during kexec, and set the boolean for SME. TDX can then leverage it. While SME could use a global flag (since it's enabled at early boot and enabled on all CPUs), the percpu flag fits TDX better: The percpu flag can be set when a CPU makes a SEAMCALL, and cleared when another WBINVD on the CPU obviates the need for a kexec-time WBINVD. Saving kexec-time WBINVD is valuable, because there is an existing race[] where kexec could proceed while another CPU is active. WBINVD could make this race worse, so it's worth skipping it when possible. -- Side effect to SME -- Today the first WBINVD in the stop_this_cpu() is performed when SME is supported* by the platform, and the second WBINVD is done in relocate_kernel() when SME is activated by the kernel. Make things simple by changing to do the second WBINVD when the platform supports SME. This allows the kernel to simply turn on this percpu boolean when bringing up a CPU by checking whether the platform supports SME. No other functional change intended. [*] The aforementioned race: During kexec native_stop_other_cpus() is called to stop all remote CPUs before jumping to the new kernel. native_stop_other_cpus() firstly sends normal REBOOT vector IPIs to stop remote CPUs and waits them to stop. If that times out, it sends NMI to stop the CPUs that are still alive. The race happens when native_stop_other_cpus() has to send NMIs and could potentially result in the system hang (for more information please see [1]). Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/ [1] Link: https://lore.kernel.org/all/20250901160930.1785244-3-pbonzini%40redhat.com
2025-09-05	x86/kexec: Consolidate relocate_kernel() function parameters	Kai Huang	2	-19/+28
	During kexec, the kernel jumps to the new kernel in relocate_kernel(), which is implemented in assembly and both 32-bit and 64-bit have their own version. Currently, for both 32-bit and 64-bit, the last two parameters of the relocate_kernel() are both 'unsigned int' but actually they only convey a boolean, i.e., one bit information. The 'unsigned int' has enough space to carry two bits information therefore there's no need to pass the two booleans in two separate 'unsigned int'. Consolidate the last two function parameters of relocate_kernel() into a single 'unsigned int' and pass flags instead. Only consolidate the 64-bit version albeit the similar optimization can be done for the 32-bit version too. Don't bother changing the 32-bit version while it is working (since assembly code change is required). Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/all/20250901160930.1785244-2-pbonzini%40redhat.com
2025-09-05	x86/mce: Remove __mcheck_cpu_init_early()	Yazen Ghannam	2	-14/+4
	The __mcheck_cpu_init_early() function was introduced so that some vendor-specific features are detected before the first MCA polling event done in __mcheck_cpu_init_generic(). Currently, __mcheck_cpu_init_early() is only used on AMD-based systems and additional code will be needed to support various system configurations. However, the current and future vendor-specific code should be done during vendor init. This keeps all the vendor code in a common location and simplifies the generic init flow. Move all the __mcheck_cpu_init_early() code into mce_amd_feature_init(). Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-6-865768a2eef8@amd.com
2025-09-05	x86/mce: Cleanup bank processing on init	Borislav Petkov	1	-45/+18
	Unify the bank preparation into __mcheck_cpu_init_clear_banks(), rename that function to what it does now - prepares banks. Do this so that generic and vendor banks init goes first so that settings done during that init can take effect before the first bank polling takes place. Move __mcheck_cpu_check_banks() into __mcheck_cpu_init_prepare_banks() as it already loops over the banks. The MCP_DONTLOG flag is no longer needed, since the MCA polling function is now called only if boot-time logging should be done. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-5-865768a2eef8@amd.com
2025-09-05	x86/mce/amd: Put list_head in threshold_bank	Yazen Ghannam	1	-31/+12
	The threshold_bank structure is a container for one or more threshold_block structures. Currently, the container has a single pointer to the 'first' threshold_block structure which then has a linked list of the remaining threshold_block structures. This results in an extra level of indirection where the 'first' block is checked before iterating over the remaining blocks. Remove the indirection by including the head of the block list in the threshold_bank structure which already acts as a container for all the bank's thresholding blocks. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-8-236dd74f645f@amd.com
2025-09-05	x86/mce/amd: Remove smca_banks_map	Yazen Ghannam	1	-41/+9
	The MCx_MISC0[BlkPtr] field was used on legacy systems to hold a register offset for the next MCx_MISC* register. In this way, an implementation-specific number of registers can be discovered at runtime. The MCAX/SMCA register space simplifies this by always including the MCx_MISC[1-4] registers. The MCx_MISC0[BlkPtr] field is used to indicate (true/false) whether any MCx_MISC[1-4] registers are present. Currently, MCx_MISC0[BlkPtr] is checked early and cached to be used during sysfs init later. This is unnecessary as the MCx_MISC0 register is read again later anyway. Remove the smca_banks_map variable as it is effectively redundant, and use a direct register/bit check instead. [ bp: Zap smca_get_block_address() too. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250825-wip-mca-updates-v5-3-865768a2eef8@amd.com
2025-09-05	x86/mce/amd: Remove return value for mce_threshold_{create,remove}_device()	Yazen Ghannam	2	-12/+14
	The return values are not checked, so set return type to 'void'. Also, move function declarations to internal.h, since these functions are only used within the MCE subsystem. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-6-236dd74f645f@amd.com
2025-09-05	x86/mce/amd: Rename threshold restart function	Yazen Ghannam	1	-6/+6
	It operates per block rather than per bank. So rename it for clarity. No functional changes. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-5-236dd74f645f@amd.com
2025-09-05	Merge branch 'x86/apic' into x86/sev, to resolve conflict	Ingo Molnar	4	-51/+491
	Conflicts: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-09-04	x86,retpoline: Optimize patch_retpoline()	Peter Zijlstra	1	-16/+26
	Currently the very common retpoline: "CS CALL __x86_indirect_thunk_r11" is transformed into "CALL R11; NOP3" for eIBRS/BHI_NO parts. Similarly, paranoid fineibt has: "CALL R11; NOP". Recognise that CS stuffing can avoid the extra NOP. However, due to prefix decode penalties, make sure to not emit too many CS prefixes. Notably: "CS CALL __x86_indirect_thunk_rax" must not become "CS CS CS CS CALL RAX". Prefix decode penalties are typically many more cycles than decoding an extra NOP. Additionally, if the retpoline is a tail-call, the "JMP %\reg" should be followed by INT3 for straight-line-speculation mitigation, since emit_indirect() now has a length argument, move this into emit_indirect() such that other users (paranoid-fineibt) also do this. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250902104627.GM4068168@noisy.programming.kicks-ass.net
2025-09-04	x86,ibt: Use UDB instead of 0xEA	Peter Zijlstra	2	-97/+128
	A while ago [0] FineIBT started using the 0xEA instruction to raise #UD. All existing parts will generate #UD in 64bit mode on that instruction. However; Intel/AMD have not blessed using this instruction, it is on their 'reserved' opcode list for future use. Peter Anvin worked the committees and got use of 0xD6 blessed, it shall be called UDB (per the next SDM or so), and it being a single byte instruction is easy to slip into a single byte immediate -- as is done by this very patch. Reworking the FineIBT code to use UDB wasn't entirely trivial. Notably the FineIBT-BHI1 case ran out of bytes. In order to condense the encoding some it was required to move the hash register from R10D to EAX (thanks hpa!). Per the x86_64 ABI, RAX is used to pass the number of vector registers for vararg function calls -- something that should not happen in the kernel. More so, the kernel is built with -mskip-rax-setup, which should leave RAX completely unused, allowing its re-use. [ For BPF; while the bpf2bpf tail-call uses RAX in its calling convention, that does not use CFI and is unaffected. Only the 'regular' C->BPF transition is covered by CFI. ] The ENDBR poison value is changed from 'OSP NOP3' to 'NOPL -42(%RAX)', this is basically NOP4 but with UDB as its immediate. As such it is still a non-standard NOP value unique to prior ENDBR sites, but now also provides UDB. Per Agner Fog's optimization guide, Jcc is assumed not-taken. That is, the expected path should be the fallthrough case for improved throughput. Since the preamble now relies on the ENDBR poison to provide UDB, the code is changed to write the poison right along with the initial preamble -- this is possible because the ITS mitigation already disabled IBT over rewriting the CFI scheme. The scheme in detail: Preamble: FineIBT FineIBT-BHI1 FineIBT-BHI __cfi_\func: __cfi_\func: __cfi_\func: endbr endbr endbr subl $0x12345678, %eax subl $0x12345678, %eax subl $0x12345678, %eax jne.d32,np \func+3 cmovne %rax, %rdi cs cs call __bhi_args_N jne.d8,np \func+3 \func: \func: \func: nopl -42(%rax) nopl -42(%rax) nopl -42(%rax) Notably there are 7 bytes available after the SUBL; this enables the BHI1 case to fit without the nasty overlapping case it had previously. The !BHI case uses Jcc.d32,np to consume all 7 bytes without the need for an additional NOP, while the BHI case uses CS padding to align the CALL with the end of the preamble such that it returns to \func+0. Caller: FineIBT Paranoid-FineIBT fineibt_caller: fineibt_caller: mov $0x12345678, %eax mov $0x12345678, %eax lea -10(%r11), %r11 cmp -0x11(%r11), %eax nop5 cs lea -0x10(%r11), %r11 retpoline: retpoline: cs call __x86_indirect_thunk_r11 jne fineibt_caller+0xd call %r11 nop Notably this is before apply_retpolines() which will fix up the retpoline call -- since all parts with IBT also have eIBRS (lets ignore ITS). Typically the retpoline site is rewritten (when still intact) into: call %r11 nop3 [0] 06926c6cdb95 ("x86/ibt: Optimize the FineIBT instruction sequence") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250901191307.GI4067720@noisy.programming.kicks-ass.net
2025-09-04	x86/cfi: Add "debug" option to "cfi=" bootparam	Kees Cook	1	-0/+16
	Add "debug" option for "cfi=" bootparam to get details on early CFI initialization steps so future Kees can find breakage easier. Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250904034656.3670313-5-kees@kernel.org
2025-09-04	x86/cfi: Standardize on common "CFI:" prefix for CFI reports	Kees Cook	1	-7/+10
	Use a regular "CFI:" prefix for CFI reports during alternatives setup, including reporting when nothing has happened (i.e. CONFIG_FINEIBT=n). Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20250904034656.3670313-4-kees@kernel.org