aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-18Merge tag 'kvm-x86-fixes-6.18-rc2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-3/+4
KVM x86 fixes for 6.18: - Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the bug fixed by commit 3ccbf6f47098 ("KVM: x86/mmu: Return -EAGAIN if userspace deletes/moves memslot during prefault") - Don't try to get PMU capabbilities from perf when running a CPU with hybrid CPUs/PMUs, as perf will rightly WARN. - Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more generic KVM_CAP_GUEST_MEMFD_FLAGS - Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set said flag to initialize memory as SHARED, irrespective of MMAP. The behavior merged in 6.18 is that enabling mmap() implicitly initializes memory as SHARED, which would result in an ABI collision for x86 CoCo VMs as their memory is currently always initialized PRIVATE. - Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private memory, to enable testing such setups, i.e. to hopefully flush out any other lurking ABI issues before 6.18 is officially released. - Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP, and host userspace accesses to mmap()'d private memory.
2025-10-10KVM: guest_memfd: Allow mmap() on guest_memfd for x86 VMs with private memorySean Christopherson1-3/+4
Allow mmap() on guest_memfd instances for x86 VMs with private memory as the need to track private vs. shared state in the guest_memfd instance is only pertinent to INIT_SHARED. Doing mmap() on private memory isn't terrible useful (yet!), but it's now possible, and will be desirable when guest_memfd gains support for other VMA-based syscalls, e.g. mbind() to set NUMA policy. Lift the restriction now, before MMAP support is officially released, so that KVM doesn't need to add another capability to enumerate support for mmap() on private memory. Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files") Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20251003232606.4070510-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-10-07Merge tag 'hyperv-next-signed-20251006' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Unify guest entry code for KVM and MSHV (Sean Christopherson) - Switch Hyper-V MSI domain to use msi_create_parent_irq_domain() (Nam Cao) - Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV (Mukesh Rathor) - Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov) - Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna Kumar T S M) - Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari, Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley) * tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hyperv: Remove the spurious null directive line MAINTAINERS: Mark hyperv_fb driver Obsolete fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver Drivers: hv: Make CONFIG_HYPERV bool Drivers: hv: Add CONFIG_HYPERV_VMBUS option Drivers: hv: vmbus: Fix typos in vmbus_drv.c Drivers: hv: vmbus: Fix sysfs output format for ring buffer index Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store() x86/hyperv: Switch to msi_create_parent_irq_domain() mshv: Use common "entry virt" APIs to do work in root before running guest entry: Rename "kvm" entry code assets to "virt" to genericize APIs entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper mshv: Handle NEED_RESCHED_LAZY before transferring to guest x86/hyperv: Add kexec/kdump support on Azure CVMs Drivers: hv: Simplify data structures for VMBus channel close message Drivers: hv: util: Cosmetic changes for hv_utils_transport.c mshv: Add support for a new parent partition configuration clocksource: hyper-v: Skip unnecessary checks for the root partition hyperv: Add missing field to hv_output_map_device_interrupt
2025-09-30entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM properSean Christopherson1-2/+1
Move KVM's morphing of pending signals into userspace exits into KVM proper, and drop the @vcpu param from xfer_to_guest_mode_handle_work(). How KVM responds to -EINTR is a detail that really belongs in KVM itself, and invoking kvm_handle_signal_exit() from kernel code creates an inverted module dependency. E.g. attempting to move kvm_handle_signal_exit() into kvm_main.c would generate an linker error when building kvm.ko as a module. Dropping KVM details will also converting the KVM "entry" code into a more generic virtualization framework so that it can be used when running as a Hyper-V root partition. Lastly, eliminating usage of "struct kvm_vcpu" outside of KVM is also nice to have for KVM x86 developers, as keeping the details of kvm_vcpu purely within KVM allows changing the layout of the structure without having to boot into a new kernel, e.g. allows rebuilding and reloading kvm.ko with a modified kvm_vcpu structure as part of debug/development. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30KVM: x86: Export KVM-internal symbols for sub-modules onlySean Christopherson1-110/+110
Rework almost all of KVM x86's exports to expose symbols only to KVM's vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly designed/intended for external usage. Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30KVM: x86: Drop pointless exports of kvm_arch_xxx() hooksSean Christopherson1-3/+0
Drop the exporting of several kvm_arch_xxx() hooks that are only called from arch-neutral code, i.e. that are only called from kvm.ko. Link: https://lore.kernel.org/r/20250919003303.1355064-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-09-30Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-23/+387
KVM x86 CET virtualization support for 6.18 Add support for virtualizing Control-flow Enforcement Technology (CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks). CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect Branch Tracking (IBT), that can be utilized by software to help provide Control-flow integrity (CFI). SHSTK defends against backward-edge attacks (a.k.a. Return-oriented programming (ROP)), while IBT defends against forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)). Attackers commonly use ROP and COP/JOP methodologies to redirect the control- flow to unauthorized targets in order to execute small snippets of code, a.k.a. gadgets, of the attackers choice. By chaining together several gadgets, an attacker can perform arbitrary operations and circumvent the system's defenses. SHSTK defends against backward-edge attacks, which execute gadgets by modifying the stack to branch to the attacker's target via RET, by providing a second stack that is used exclusively to track control transfer operations. The shadow stack is separate from the data/normal stack, and can be enabled independently in user and kernel mode. When SHSTK is is enabled, CALL instructions push the return address on both the data and shadow stack. RET then pops the return address from both stacks and compares the addresses. If the return addresses from the two stacks do not match, the CPU generates a Control Protection (#CP) exception. IBT defends against backward-edge attacks, which branch to gadgets by executing indirect CALL and JMP instructions with attacker controlled register or memory state, by requiring the target of indirect branches to start with a special marker instruction, ENDBRANCH. If an indirect branch is executed and the next instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH behaves as a NOP if IBT is disabled or unsupported. From a virtualization perspective, CET presents several problems. While SHSTK and IBT have two layers of enabling, a global control in the form of a CR4 bit, and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS. Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable option due to complexity, and outright disallowing use of XSTATE to context switch SHSTK/IBT state would render the features unusable to most guests. To limit the overall complexity without sacrificing performance or usability, simply ignore the potential virtualization hole, but ensure that all paths in KVM treat SHSTK/IBT as usable by the guest if the feature is supported in hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow userspace to advertise one of SHSTK or IBT if both are supported in hardware, even though doing so would allow a misbehaving guest to use the unadvertised feature. Fully emulating SHSTK and IBT would also require significant complexity, e.g. to track and update branch state for IBT, and shadow stack state for SHSTK. Given that emulating large swaths of the guest code stream isn't necessary on modern CPUs, punt on emulating instructions that meaningful impact or consume SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation of such instructions so that KVM's emulator can't be abused to circumvent CET. Disable support for SHSTK and IBT if KVM is configured such that emulation of arbitrary guest instructions may be required, specifically if Unrestricted Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that is smaller than host.MAXPHYADDR. Lastly disable SHSTK support if shadow paging is enabled, as the protections for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so that they can't be directly modified by software), i.e. would require non-trivial support in the Shadow MMU. Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT in SVM requires KVM modifications.
2025-09-30Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-144/+186
KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes.
2025-09-30Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+8
KVM SVM changes for 6.18 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors instead of latent guest failures. - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted host from tampering with the guest's TSC frequency, while still allowing the the VMM to configure the guest's TSC frequency prior to launch. - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by wrapping all accesses via READ_ONCE(). - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a bogous XCR0 value in KVM's software model. - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to avoid leaving behind stale state (thankfully not consumed in KVM). - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE instead of subtly relying on guest_memfd to do the "heavy" lifting. - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's TSC_AUX due to hardware not matching the value cached in the user-return MSR infrastructure. - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported, and clean up the AVIC initialization code along the way.
2025-09-30Merge tag 'loongarch-kvm-6.18' of ↵Paolo Bonzini1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.18 1. Add PTW feature detection on new hardware. 2. Add sign extension with kernel MMIO/IOCSR emulation. 3. Improve in-kernel IPI emulation. 4. Improve in-kernel PCH-PIC emulation. 5. Move kvm_iocsr tracepoint out of generic code.
2025-09-23KVM: x86: Add XSS support for CET_KERNEL and CET_USERYang Weijiang1-3/+15
Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT *or* SHSTK is supported. Like CR4.CET, XFEATURE support for IBT and SHSTK are bundle together under the CET umbrella, and thus prone to virtualization holes if KVM or the guest supports only one of IBT or SHSTK, but hardware supports both. However, again like CR4.CET, such virtualization holes are benign from the host's perspective so long as KVM takes care to always honor the "or" logic. Require CET_KERNEL and CET_USER to come as a pair, and refuse to support IBT or SHSTK if one (or both) features is missing, as the (host) kernel expects them to come as a pair, i.e. may get confused and corrupt state if only one of CET_KERNEL or CET_USER is supported. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Emulate SSP[63:32]!=0 #GP(0) for FAR JMP to 32-bit modeSean Christopherson1-0/+9
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's pseudocode for FAR JMP: IF ShadowStackEnabled(CPL) IF (IA32_EFER.LMA and DEST(segment selector).L) = 0 (* If target is legacy or compatibility mode then the SSP must be in low 4GB *) IF (SSP & 0xFFFFFFFF00000000 != 0); THEN #GP(0); FI; FI; FI; Note, only the current CPL needs to be considered, as FAR JMP can't be used for inter-privilege level transfers, and KVM rejects emulation of all other far branch instructions when Shadow Stacks are enabled. To give the emulator access to GUEST_SSP, special case handling MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a host access (KVM doesn't allow guest accesses to internal "MSRs"). The ->get_msr() API is only used for implicit accesses from the emulator, i.e. is only used with hardcoded MSR indices, and so any access to MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the guest via RDMSR. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Don't emulate task switches when IBT or SHSTK is enabledSean Christopherson1-7/+28
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers task switch emulation with Indirect Branch Tracking or Shadow Stacks enabled, as attempting to do the right thing would require non-trivial effort and complexity, KVM doesn't support emulating CET generally, and it's extremely unlikely that any guest will do task switches while also utilizing CET. Defer taking on the complexity until someone cares enough to put in the time and effort to add support. Per the SDM: If shadow stack is enabled, then the SSP of the task is located at the 4 bytes at offset 104 in the 32-bit TSS and is used by the processor to establish the SSP when a task switch occurs from a task associated with this TSS. Note that the processor does not write the SSP of the task initiating the task switch to the TSS of that task, and instead the SSP of the previous task is pushed onto the shadow stack of the new task. Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new privilege level is updated. To keep things simple, check both S_CET and U_CET (again, anyone that wants more precise checking can have the honor of implementing support). Reported-by: Binbin Wu <binbin.wu@linux.intel.com> Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: VMX: Set host constant supervisor states to VMCS fieldsYang Weijiang1-0/+12
Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly. Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after post-boot(The exception is BIOS call case but vCPU thread never across it) and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/ VM-Exit sequence. Host supervisor shadow stack is not enabled now and SSP is not accessible to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/ SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc. Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set in MSR_IA32_S_CET as KVM cannot co-exit with it correctly. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: snapshot host S_CET if SHSTK *or* IBT is supported] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: VMX: Emulate read and write to CET MSRsYang Weijiang1-2/+62
Add emulation interface for CET MSR access. The emulation code is split into common part and vendor specific part. The former does common checks for MSRs, e.g., accessibility, data validity etc., then passes operation to either XSAVE-managed MSRs via the helpers or CET VMCS fields. SSP can only be read via RDSSP. Writing even requires destructive and potentially faulting operations such as SAVEPREVSSP/RSTORSSP or SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper for the GUEST_SSP field of the VMCS. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop call to kvm_set_xstate_msr() for S_CET, consolidate code] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Enable guest SSP read/write interface with new uAPIsYang Weijiang1-4/+33
Add a KVM-defined ONE_REG register, KVM_REG_GUEST_SSP, to let userspace save and restore the guest's Shadow Stack Pointer (SSP). On both Intel and AMD, SSP is a hardware register that can only be accessed by software via dedicated ISA (e.g. RDSSP) or via VMCS/VMCB fields (used by hardware to context switch SSP at entry/exit). As a result, SSP doesn't fit in any of KVM's existing interfaces for saving/restoring state. Internally, treat SSP as a fake/synthetic MSR, as the semantics of writes to SSP follow that of several other Shadow Stack MSRs, e.g. the PLx_SSP MSRs. Use a translation layer to hide the KVM-internal MSR index so that the arbitrary index doesn't become ABI, e.g. so that KVM can rework its implementation as needed, so long as the ONE_REG ABI is maintained. Explicitly reject accesses to SSP if the vCPU doesn't have Shadow Stack support to avoid running afoul of ignore_msrs, which unfortunately applies to host-initiated accesses (which is a discussion for another day). I.e. ensure consistent behavior for KVM-defined registers irrespective of ignore_msrs. Link: https://lore.kernel.org/all/aca9d389-f11e-4811-90cf-d98e345a5cc2@intel.com Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-14-seanjc@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Report KVM supported CET MSRs as to-be-savedYang Weijiang1-0/+18
Add CET MSRs to the list of MSRs reported to userspace if the feature, i.e. IBT or SHSTK, associated with the MSRs is supported by KVM. Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Add fault checks for guest CR4.CET settingYang Weijiang1-0/+6
Check potential faults for CR4.CET setting per Intel SDM requirements. CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET == 1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919223258.1604852-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Load guest FPU state when access XSAVE-managed MSRsSean Christopherson1-1/+86
Load the guest's FPU state if userspace is accessing MSRs whose values are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(), to facilitate access to such kind of MSRs. If MSRs supported in kvm_caps.supported_xss are passed through to guest, the guest MSRs are swapped with host's before vCPU exits to userspace and after it reenters kernel before next VM-entry. Because the modified code is also used for the KVM_GET_MSRS device ioctl(), explicitly check @vcpu is non-null before attempting to load guest state. The XSAVE-managed MSRs cannot be retrieved via the device ioctl() without loading guest FPU state (which doesn't exist). Note that guest_cpuid_has() is not queried as host userspace is allowed to access MSRs that have not been exposed to the guest, e.g. it might do KVM_SET_MSRS prior to KVM_SET_CPUID2. The two helpers are put here in order to manifest accessing xsave-managed MSRs requires special check and handling to guarantee the correctness of read/write to the MSRs. Co-developed-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: drop S_CET, add big comment, move accessors to x86.c] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250919223258.1604852-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Initialize kvm_caps.supported_xssYang Weijiang1-8/+15
Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if XSAVES is supported. host_xss contains the host supported xstate feature bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM enabled XSS feature bits, the resulting value represents the supervisor xstates that are available to guest and are backed by host FPU framework for swapping {guest,host} XSAVE-managed registers/MSRs. [sean: relocate and enhance comment about PT / XSS[8] ] Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSSYang Weijiang1-0/+2
Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size due to XSS MSR modification. CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest before allocate sufficient xsave buffer. Note, KVM does not yet support any XSS based features, i.e. supported_xss is guaranteed to be zero at this time. Opportunistically skip CPUID updates if XSS value doesn't change. Suggested-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Check XSS validity against guest CPUIDsChao Gao1-4/+3
Maintain per-guest valid XSS bits and check XSS validity against them rather than against KVM capabilities. This is to prevent bits that are supported by KVM but not supported for a guest from being set. Opportunistically return KVM_MSR_RET_UNSUPPORTED on IA32_XSS MSR accesses if guest CPUID doesn't enumerate X86_FEATURE_XSAVES. Since KVM_MSR_RET_UNSUPPORTED takes care of host_initiated cases, drop the host_initiated check. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Report XSS as to-be-saved if there are supported featuresSean Christopherson1-1/+5
Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss is non-zero, i.e. KVM supports at least one XSS based feature. Before enabling CET virtualization series, guest IA32_MSR_XSS is guaranteed to be 0, i.e., XSAVES/XRSTORS is executed in non-root mode with XSS == 0, which equals to the effect of XSAVE/XRSTOR. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Introduce KVM_{G,S}ET_ONE_REG uAPIs supportYang Weijiang1-0/+100
Enable KVM_{G,S}ET_ONE_REG uAPIs so that userspace can access MSRs and other non-MSR registers through them, along with support for KVM_GET_REG_LIST to enumerate support for KVM-defined registers. This is in preparation for allowing userspace to read/write the guest SSP register, which is needed for the upcoming CET virtualization support. Currently, two types of registers are supported: KVM_X86_REG_TYPE_MSR and KVM_X86_REG_TYPE_KVM. All MSRs are in the former type; the latter type is added for registers that lack existing KVM uAPIs to access them. The "KVM" in the name is intended to be vague to give KVM flexibility to include other potential registers. More precise names like "SYNTHETIC" and "SYNTHETIC_MSR" were considered, but were deemed too confusing (e.g. can be conflated with synthetic guest-visible MSRs) and may put KVM into a corner (e.g. if KVM wants to change how a KVM-defined register is modeled internally). Enumerate only KVM-defined registers in KVM_GET_REG_LIST to avoid duplicating KVM_GET_MSR_INDEX_LIST, and so that KVM can return _only_ registers that are fully supported (KVM_GET_REG_LIST is vCPU-scoped, i.e. can be precise, whereas KVM_GET_MSR_INDEX_LIST is system-scoped). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Link: https://lore.kernel.org/all/20240219074733.122080-18-weijiang.yang@intel.com [1] Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919223258.1604852-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependenciesSean Christopherson1-1/+2
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for SEV-ES+ guests.
2025-09-23KVM: x86: Add helper to retrieve current value of user return MSRHou Wenlong1-0/+6
In the user return MSR support, the cached value is always the hardware value of the specific MSR. Therefore, add a helper to retrieve the cached value, which can replace the need for RDMSR, for example, to allow SEV-ES guests to restore the correct host hardware value without using RDMSR. Cc: stable@vger.kernel.org Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: drop "cache" from the name, make it a one-liner, tag for stable] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Validate XCR0 provided by guest in GHCBSean Christopherson1-1/+2
Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's software model in order to validate the new XCR0 against KVM's view of the supported XCR0. Allowing garbage is thankfully mostly benign, as kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected state, xstate_required_size() will simply provide garbage back to the guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS will only harm the guest (setting XCR0 will fail). However, allowing the guest to put junk into a field that KVM assumes is valid is a CVE waiting to happen. And as a bonus, using the proper API eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits if XCR0 isn't actually changing (relatively to KVM's previous snapshot). Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTSSean Christopherson1-8/+7
Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS request, and expand the responsibilities of vendor code to recalculate all intercepts that vary based on userspace input, e.g. instruction intercepts that are tied to guest CPUID. Providing a generic recalc request will allow the upcoming mediated PMU support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are set by userspace, without having to make multiple calls to/from PMU code. As a bonus, using a request will effectively coalesce recalcs, e.g. will reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create, set CPUID, set PERF_CAPABILITIES (Intel only), set filter). The downside is that MSR filter changes that are done in isolation will do a small amount of unnecessary work, but that's already a relatively slow path, and the cost of recalculating instruction intercepts is negligible. Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-16KVM: TDX: Reject fully in-kernel irqchip if EOIs are protected, i.e. for TDX VMsSagi Shahar1-0/+9
Reject KVM_CREATE_IRQCHIP if the VM type has protected EOIs, i.e. if KVM can't intercept EOI and thus can't faithfully emulate level-triggered interrupts that are routed through the I/O APIC. For TDX VMs, the TDX-Module owns the VMX EOI-bitmap and configures all IRQ vectors to have the CPU accelerate EOIs, i.e. doesn't allow KVM to intercept any EOIs. KVM already requires a split irqchip[1], but does so during vCPU creation, which is both too late to allow userspace to fallback to a split irqchip and a less-than-stellar experience for userspace since an -EINVAL on KVM_VCPU_CREATE is far harder to debug/triage than failure exactly on KVM_CREATE_IRQCHIP. And of course, allowing an action that ultimately fails is arguably a bug regardless of the impact on userspace. Link: https://lore.kernel.org/lkml/20250222014757.897978-11-binbin.wu@linux.intel.com [1] Link: https://lore.kernel.org/lkml/aK3vZ5HuKKeFuuM4@google.com Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sagi Shahar <sagis@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250827011726.2451115-1-sagis@google.com [sean: massage shortlog+changelog, relocate setting has_protected_eoi] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-10Merge tag 'vmscape-for-linus-20250904' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vmescape mitigation fixes from Dave Hansen: "Mitigate vmscape issue with indirect branch predictor flushes. vmscape is a vulnerability that essentially takes Spectre-v2 and attacks host userspace from a guest. It particularly affects hypervisors like QEMU. Even if a hypervisor may not have any sensitive data like disk encryption keys, guest-userspace may be able to attack the guest-kernel using the hypervisor as a confused deputy. There are many ways to mitigate vmscape using the existing Spectre-v2 defenses like IBRS variants or the IBPB flushes. This series focuses solely on IBPB because it works universally across vendors and all vulnerable processors. Further work doing vendor and model-specific optimizations can build on top of this if needed / wanted. Do the normal issue mitigation dance: - Add the CPU bug boilerplate - Add a list of vulnerable CPUs - Use IBPB to flush the branch predictors after running guests" * tag 'vmscape-for-linus-20250904' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vmscape: Add old Intel CPUs to affected list x86/vmscape: Warn when STIBP is disabled with SMT x86/bugs: Move cpu_bugs_smt_update() down x86/vmscape: Enable the mitigation x86/vmscape: Add conditional IBPB mitigation x86/vmscape: Enumerate VMSCAPE bug Documentation/hw-vuln: Add VMSCAPE documentation
2025-09-10KVM: x86: Move vector_hashing into lapic.cSean Christopherson1-8/+0
Move the vector_hashing module param into lapic.c now that all usage is contained within the local APIC emulation code. Opportunistically drop the accessor and append "_enabled" to the variable to help capture that it's a boolean module param. No functional change intended. Link: https://lore.kernel.org/r/20250821214209.3463350-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-27KVM: guest_memfd: Add plumbing to host to map guest_memfd pagesFuad Tabba1-0/+11
Introduce the core infrastructure to enable host userspace to mmap() guest_memfd-backed memory. This is needed for several evolving KVM use cases: * Non-CoCo VM backing: Allows VMMs like Firecracker to run guests entirely backed by guest_memfd, even for non-CoCo VMs [1]. This provides a unified memory management model and simplifies guest memory handling. * Direct map removal for enhanced security: This is an important step for direct map removal of guest memory [2]. By allowing host userspace to fault in guest_memfd pages directly, we can avoid maintaining host kernel direct maps of guest memory. This provides additional hardening against Spectre-like transient execution attacks by removing a potential attack surface within the kernel. * Future guest_memfd features: This also lays the groundwork for future enhancements to guest_memfd, such as supporting huge pages and enabling in-place sharing of guest memory with the host for CoCo platforms that permit it [3]. Enable the basic mmap and fault handling logic within guest_memfd, but hold off on allow userspace to actually do mmap() until the architecture support is also in place. [1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding [2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com [3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-19KVM: x86: Zero XSTATE components on INIT by iterating over supported featuresChao Gao1-3/+9
Tweak the code a bit to facilitate resetting more xstate components in the future, e.g., CET's xstate-managed MSRs. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250812025606.74625-6-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Manually clear MPX state only on INITSean Christopherson1-16/+30
Don't manually clear/zero MPX state on RESET, as the guest FPU state is zero allocated and KVM only does RESET during vCPU creation, i.e. the relevant state is guaranteed to be all zeroes. Opportunistically move the relevant code into a helper in anticipation of adding support for CET shadow stacks, which also has state that is zeroed on INIT. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250812025606.74625-5-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Add kvm_msr_{read,write}() helpersYang Weijiang1-3/+13
Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the helpers to replace existing usage of the raw functions. kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs to get/set a MSR value for emulating CPU behavior, i.e., host_initiated == %true in the helpers. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250812025606.74625-4-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Use double-underscore read/write MSR helpers as appropriateSean Christopherson1-13/+16
Use the double-underscore helpers for emulating MSR reads and writes in he no-underscore versions to better capture the relationship between the two sets of APIs (the double-underscore versions don't honor userspace MSR filters). No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250812025606.74625-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Rename kvm_{g,s}et_msr()* to show that they emulate guest accessesYang Weijiang1-13/+14
Rename kvm_{g,s}et_msr_with_filter() kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write} __kvm_emulate_msr_{read,write} to make it more obvious that KVM uses these helpers to emulate guest behaviors, i.e., host_initiated == false in these helpers. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Link: https://lore.kernel.org/r/20250812025606.74625-2-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: VMX: Support the immediate form of WRMSRNS in the VM-Exit fastpathXin Li1-4/+13
Add support for handling "WRMSRNS with an immediate" VM-Exits in KVM's fastpath. On Intel, all writes to the x2APIC ICR and to the TSC Deadline MSR are non-serializing, i.e. it's highly likely guest kernels will switch to using WRMSRNS when possible. And in general, any MSR written via WRMSRNS is probably worth handling in the fastpath, as the entire point of WRMSRNS is to shave cycles in hot paths. Signed-off-by: Xin Li (Intel) <xin@zytor.com> [sean: rewrite changelog, split rename to separate patch] Link: https://lore.kernel.org/r/20250805202224.1475590-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Add support for RDMSR/WRMSRNS w/ immediate on IntelXin Li1-10/+45
Add support for the immediate forms of RDMSR and WRMSRNS (currently Intel-only). The immediate variants are only valid in 64-bit mode, and use a single general purpose register for the data (the register is also encoded in the instruction, i.e. not implicit like regular RDMSR/WRMSR). The immediate variants are primarily motivated by performance, not code size: by having the MSR index in an immediate, it is available *much* earlier in the CPU pipeline, which allows hardware much more leeway about how a particular MSR is handled. Intel VMX support for the immediate forms of MSR accesses communicates exit information to the host as follows: 1) The immediate form of RDMSR uses VM-Exit Reason 84. 2) The immediate form of WRMSRNS uses VM-Exit Reason 85. 3) For both VM-Exit reasons 84 and 85, the Exit Qualification field is set to the MSR index that triggered the VM-Exit. 4) Bits 3 ~ 6 of the VM-Exit Instruction Information field are set to the register encoding used by the immediate form of the instruction, i.e. the destination register for RDMSR, and the source for WRMSRNS. 5) The VM-Exit Instruction Length field records the size of the immediate form of the MSR instruction. To deal with userspace RDMSR exits, stash the destination register in a new kvm_vcpu_arch field, similar to cui_linear_rip, pio, etc. Alternatively, the register could be saved in kvm_run.msr or re-retrieved from the VMCS, but the former would require sanitizing the value to ensure userspace doesn't clobber the value to an out-of-bounds index, and the latter would require a new one-off kvm_x86_ops hook. Don't bother adding support for the instructions in KVM's emulator, as the only way for RDMSR/WRMSR to be encountered is if KVM is emulating large swaths of code due to invalid guest state, and a vCPU cannot have invalid guest state while in 64-bit mode. Signed-off-by: Xin Li (Intel) <xin@zytor.com> [sean: minor tweaks, massage and expand changelog] Link: https://lore.kernel.org/r/20250805202224.1475590-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Rename handle_fastpath_set_msr_irqoff() to handle_fastpath_wrmsr()Xin Li1-2/+2
Rename the WRMSR fastpath API to drop "irqoff", as that information is redundant (the fastpath always runs with IRQs disabled), and to prepare for adding a fastpath for the immediate variant of WRMSRNS. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> [sean: split to separate patch, write changelog] Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Rename local "ecx" variables to "msr" and "pmc" as appropriateSean Christopherson1-12/+12
Rename "ecx" variables in {RD,WR}MSR and RDPMC helpers to "msr" and "pmc" respectively, in anticipation of adding support for the immediate variants of RDMSR and WRMSRNS, and to better document what the variables hold (versus where the data originated). No functional change intended. Link: https://lore.kernel.org/r/20250805202224.1475590-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Add a fastpath handler for INVDSean Christopherson1-0/+9
Add a fastpath handler for INVD so that the common fastpath logic can be trivially tested on both Intel and AMD. Under KVM, INVD is always: (a) intercepted, (b) available to the guest, and (c) emulated as a nop, with no side effects. Combined with INVD not having any inputs or outputs, i.e. no register constraints, INVD is the perfect instruction for exercising KVM's fastpath as it can be inserted into practically any guest-side code stream. Link: https://lore.kernel.org/r/20250805190526.1453366-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Push acquisition of SRCU in fastpath into kvm_pmu_trigger_event()Sean Christopherson1-13/+5
Acquire SRCU in the VM-Exit fastpath if and only if KVM needs to check the PMU event filter, to further trim the amount of code that is executed with SRCU protection in the fastpath. Counter-intuitively, holding SRCU can do more harm than good due to masking potential bugs, and introducing a new SRCU-protected asset to code reachable via kvm_skip_emulated_instruction() would be quite notable, i.e. definitely worth auditing. E.g. the primary user of kvm->srcu is KVM's memslots, accessing memslots all but guarantees guest memory may be accessed, accessing guest memory can fault, and page faults might sleep, which isn't allowed while IRQs are disabled. Not acquiring SRCU means the (hypothetical) illegal sleep would be flagged when running with PROVE_RCU=y, even if DEBUG_ATOMIC_SLEEP=n. Note, performance is NOT a motivating factor, as SRCU lock/unlock only adds ~15 cycles of latency to fastpath VM-Exits. I.e. overhead isn't a concern _if_ SRCU protection needs to be extended beyond PMU events, e.g. to honor userspace MSR filters. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250805190526.1453366-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86/pmu: Add wrappers for counting emulated instructions/branchesSean Christopherson1-3/+3
Add wrappers for triggering instruction retired and branch retired PMU events in anticipation of reworking the internal mechanisms to track which PMCs need to be evaluated, e.g. to avoid having to walk and check every PMC. Opportunistically bury "struct kvm_pmu_emulated_event_selectors" in pmu.c. No functional change intended. Link: https://lore.kernel.org/r/20250805190526.1453366-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Fold WRMSR fastpath helpers into the main handlerSean Christopherson1-29/+5
Fold the per-MSR WRMSR fastpath helpers into the main handler now that the IPI path in particular is relatively tiny. In addition to eliminating a decent amount of boilerplate, this removes the ugly -errno/1/0 => bool conversion (which is "necessitated" by kvm_x2apic_icr_write_fast()). Opportunistically drop the comment about IPIs, as the purpose of the fastpath is hopefully self-evident, and _if_ it needs more documentation, the documentation (and rules!) should be placed in a more central location. No functional change intended. Link: https://lore.kernel.org/r/20250805190526.1453366-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Unconditionally grab data from EDX:EAX in WRMSR fastpathSean Christopherson1-3/+1
Always grab EDX:EAX in the WRMSR fastpath to deduplicate and simplify the case statements, and to prepare for handling immediate variants of WRMSRNS in the fastpath (the data register is explicitly provided in that case). There's no harm in reading the registers, as their values are always available, i.e. don't require VMREADs (or similarly slow operations). No real functional change intended. Cc: Xin Li <xin@zytor.com> Link: https://lore.kernel.org/r/20250805190526.1453366-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Acquire SRCU in WRMSR fastpath iff instruction needs to be skippedSean Christopherson1-13/+8
Acquire SRCU in the WRMSR fastpath if and only if an instruction needs to be skipped, i.e. only if the fastpath succeeds. The reasoning in commit 3f2739bd1e0b ("KVM: x86: Acquire SRCU read lock when handling fastpath MSR writes") about "avoid having to play whack-a-mole" seems sound, but in hindsight unconditionally acquiring SRCU does more harm than good. While acquiring/releasing SRCU isn't slow per se, the things that are _protected_ by kvm->srcu are generally safe to access only in the "slow" VM-Exit path. E.g. accessing memslots in generic helpers is never safe, because accessing guest memory with IRQs disabled is unless unsafe (except when kvm_vcpu_read_guest_atomic() is used, but that API should never be used in emulation helpers). In other words, playing whack-a-mole is actually desirable in this case, because every access to an asset protected by kvm->srcu warrants further scrutiny. Link: https://lore.kernel.org/r/20250805190526.1453366-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Unconditionally handle MSR_IA32_TSC_DEADLINE in fastpath exitsSean Christopherson1-3/+0
Drop the fastpath VM-Exit requirement that KVM can use the hypervisor timer to emulate the APIC timer in TSC deadline mode. I.e. unconditionally handle MSR_IA32_TSC_DEADLINE WRMSRs in the fastpath. Restricting the fastpath to *maybe* using the VMX preemption timer is ineffective and unnecessary. If the requested deadline can't be programmed into the VMX preemption timer, KVM will fall back to hrtimers, i.e. the restriction is ineffective as far as preventing any kind of worst case scenario. But guarding against a worst case scenario is completely unnecessary as the "slow" path, start_sw_tscdeadline() => hrtimer_start(), explicitly disables IRQs. In fact, the worst case scenario is when KVM thinks it can use the VMX preemption timer, as KVM will eat the overhead of calling into vmx_set_hv_timer() and falling back to hrtimers. Opportunistically limit kvm_can_use_hv_timer() to lapic.c as the fastpath code was the only external user. Stating the obvious, this allows handling MSR_IA32_TSC_DEADLINE writes in the fastpath on AMD CPUs. Link: https://lore.kernel.org/r/20250805190526.1453366-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Drop semi-arbitrary restrictions on IPI type in fastpathSean Christopherson1-7/+1
Drop the restrictions on fastpath IPIs only working for fixed IRQs with a physical destination now that the fastpath is explicitly limited to "fast" delivery. Limiting delivery to a single physical APIC ID guarantees only one vCPU will receive the event, but that isn't necessary "fast", e.g. if the targeted vCPU is the last of 4096 vCPUs. And logical destination mode or shorthand (to self) can also be fast, e.g. if only a few vCPUs are being targeted. Lastly, there's nothing inherently slow about delivering an NMI, INIT, SIPI, SMI, etc., i.e. there's no reason to artificially limit fastpath delivery to fixed vector IRQs. Link: https://lore.kernel.org/r/20250805190526.1453366-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Only allow "fast" IPIs in fastpath WRMSR(X2APIC_ICR) handlerSean Christopherson1-1/+1
Explicitly restrict fastpath ICR writes to IPIs that are "fast", i.e. can be delivered without having to walk all vCPUs, and that target at most 16 vCPUs. Artificially restricting ICR writes to physical mode guarantees at most one vCPU will receive in IPI (because x2APIC IDs are read-only), but that delivery might not be "fast". E.g. even if the vCPU exists, KVM might have to iterate over 4096 vCPUs to find the right one. Limiting delivery to fast IPIs aligns the WRMSR fastpath with kvm_arch_set_irq_inatomic() (which also runs with IRQs disabled), and will allow dropping the semi-arbitrary restrictions on delivery mode and type. Link: https://lore.kernel.org/r/20250805190526.1453366-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>