aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/svm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-09-30Merge tag 'kvm-x86-cet-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-19/+88
KVM x86 CET virtualization support for 6.18 Add support for virtualizing Control-flow Enforcement Technology (CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks). CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect Branch Tracking (IBT), that can be utilized by software to help provide Control-flow integrity (CFI). SHSTK defends against backward-edge attacks (a.k.a. Return-oriented programming (ROP)), while IBT defends against forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)). Attackers commonly use ROP and COP/JOP methodologies to redirect the control- flow to unauthorized targets in order to execute small snippets of code, a.k.a. gadgets, of the attackers choice. By chaining together several gadgets, an attacker can perform arbitrary operations and circumvent the system's defenses. SHSTK defends against backward-edge attacks, which execute gadgets by modifying the stack to branch to the attacker's target via RET, by providing a second stack that is used exclusively to track control transfer operations. The shadow stack is separate from the data/normal stack, and can be enabled independently in user and kernel mode. When SHSTK is is enabled, CALL instructions push the return address on both the data and shadow stack. RET then pops the return address from both stacks and compares the addresses. If the return addresses from the two stacks do not match, the CPU generates a Control Protection (#CP) exception. IBT defends against backward-edge attacks, which branch to gadgets by executing indirect CALL and JMP instructions with attacker controlled register or memory state, by requiring the target of indirect branches to start with a special marker instruction, ENDBRANCH. If an indirect branch is executed and the next instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH behaves as a NOP if IBT is disabled or unsupported. From a virtualization perspective, CET presents several problems. While SHSTK and IBT have two layers of enabling, a global control in the form of a CR4 bit, and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS. Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable option due to complexity, and outright disallowing use of XSTATE to context switch SHSTK/IBT state would render the features unusable to most guests. To limit the overall complexity without sacrificing performance or usability, simply ignore the potential virtualization hole, but ensure that all paths in KVM treat SHSTK/IBT as usable by the guest if the feature is supported in hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow userspace to advertise one of SHSTK or IBT if both are supported in hardware, even though doing so would allow a misbehaving guest to use the unadvertised feature. Fully emulating SHSTK and IBT would also require significant complexity, e.g. to track and update branch state for IBT, and shadow stack state for SHSTK. Given that emulating large swaths of the guest code stream isn't necessary on modern CPUs, punt on emulating instructions that meaningful impact or consume SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation of such instructions so that KVM's emulator can't be abused to circumvent CET. Disable support for SHSTK and IBT if KVM is configured such that emulation of arbitrary guest instructions may be required, specifically if Unrestricted Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that is smaller than host.MAXPHYADDR. Lastly disable SHSTK support if shadow paging is enabled, as the protections for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so that they can't be directly modified by software), i.e. would require non-trivial support in the Shadow MMU. Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT in SVM requires KVM modifications.
2025-09-30Merge tag 'kvm-x86-misc-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2-13/+25
KVM x86 changes for 6.18 - Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's intercepts and trigger a variety of WARNs in KVM. - Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is supposed to exist for v2 PMUs. - Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs. - Clean up KVM's vector hashing code for delivering lowest priority IRQs. - Clean up the fastpath handler code to only handle IPIs and WRMSRs that are actually "fast", as opposed to handling those that KVM _hopes_ are fast, and in the process of doing so add fastpath support for TSC_DEADLINE writes on AMD CPUs. - Clean up a pile of PMU code in anticipation of adding support for mediated vPMUs. - Add support for the immediate forms of RDMSR and WRMSRNS, sans full emulator support (KVM should never need to emulate the MSRs outside of forced emulation and other contrived testing scenarios). - Clean up the MSR APIs in preparation for CET and FRED virtualization, as well as mediated vPMU support. - Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs, as KVM can't faithfully emulate an I/O APIC for such guests. - KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation for mediated vPMU support, as KVM will need to recalculate MSR intercepts in response to PMU refreshes for guests with mediated vPMUs. - Misc cleanups and minor fixes.
2025-09-30Merge tag 'kvm-x86-ciphertext-6.18' of https://github.com/kvm-x86/linux into ↵Paolo Bonzini1-10/+58
HEAD KVM SEV-SNP CipherText Hiding support for 6.18 Add support for SEV-SNP's CipherText Hiding, an opt-in feature that prevents unauthorized CPU accesses from reading the ciphertext of SNP guest private memory, e.g. to attempt an offline attack. Instead of ciphertext, the CPU will always read back all FFs when CipherText Hiding is enabled. Add new module parameter to the KVM module to enable CipherText Hiding and control the number of ASIDs that can be used for VMs with CipherText Hiding, which is in effect the number of SNP VMs. When CipherText Hiding is enabled, the shared SEV-ES/SEV-SNP ASID space is split into separate ranges for SEV-ES and SEV-SNP guests, i.e. ASIDs that can be used for CipherText Hiding cannot be used to run SEV-ES guests.
2025-09-30Merge tag 'kvm-x86-svm-6.18' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini7-244/+308
KVM SVM changes for 6.18 - Require a minimum GHCB version of 2 when starting SEV-SNP guests via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors instead of latent guest failures. - Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted host from tampering with the guest's TSC frequency, while still allowing the the VMM to configure the guest's TSC frequency prior to launch. - Mitigate the potential for TOCTOU bugs when accessing GHCB fields by wrapping all accesses via READ_ONCE(). - Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a bogous XCR0 value in KVM's software model. - Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to avoid leaving behind stale state (thankfully not consumed in KVM). - Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE instead of subtly relying on guest_memfd to do the "heavy" lifting. - Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's TSC_AUX due to hardware not matching the value cached in the user-return MSR infrastructure. - Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported, and clean up the AVIC initialization code along the way.
2025-09-30Merge tag 'loongarch-kvm-6.18' of ↵Paolo Bonzini1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.18 1. Add PTW feature detection on new hardware. 2. Add sign extension with kernel MMIO/IOCSR emulation. 3. Improve in-kernel IPI emulation. 4. Improve in-kernel PCH-PIC emulation. 5. Move kvm_iocsr tracepoint out of generic code.
2025-09-23KVM: SVM: Enable shadow stack virtualization for SVMJohn Allen1-3/+0
Remove the explicit clearing of shadow stack CPU capabilities. Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's validSean Christopherson3-2/+6
Synchronize XSS from the GHCB to KVM's internal tracking if the guest marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy of XSS in order to compute the required XSTATE size when emulating CPUID.0xD.0x1 for the guest. Treat the incoming XSS change as an emulated write, i.e. validatate the guest-provided value, to avoid letting the guest load garbage into KVM's tracking. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Explicitly allow access to XSS at all times, as KVM needs to ensure its copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests and so needs to allow the host to save/restore XSS, otherwise a guest that *knows* its XSS hasn't change could get stale/bad CPUID emulation if the guest doesn't provide XSS in the GHCB on every exit. This creates a hypothetical problem where a guest could request emulation of RDMSR or WRMSR on XSS, but arguably that's not even a problem, e.g. it would be entirely reasonable for a guest to request "emulation" as a way to inform the hypervisor that its XSS value has been modified. Note, emulating the change as an MSR write also takes care of side effects, e.g. marking dynamic CPUID bits as dirty. Suggested-by: John Allen <john.allen@amd.com> base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Pass through shadow stack MSRs as appropriateJohn Allen1-0/+11
Pass through XSAVE managed CET MSRs on SVM when KVM supports shadow stack. These cannot be intercepted without also intercepting XSAVE which would likely cause unacceptable performance overhead. MSR_IA32_INT_SSP_TAB is not managed by XSAVE, so it is intercepted. Reviewed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Update dump_vmcb with shadow stack save area additionsJohn Allen1-0/+11
Add shadow stack VMCB fields to dump_vmcb. PL0_SSP, PL1_SSP, PL2_SSP, PL3_SSP, and U_CET are part of the SEV-ES save area and are encrypted, but can be decrypted and dumped if the guest policy allows debugging. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02Sean Christopherson1-0/+20
Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state simply copies the entire save area). SVM doesn't provide a way to disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide nested support before advertising SHSTK to userspace. Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Emulate reads and writes to shadow stack MSRsJohn Allen2-1/+23
Emulate shadow stack MSR access by reading and writing to the corresponding fields in the VMCB. Signed-off-by: John Allen <john.allen@amd.com> [sean: mark VMCB_CET dirty/clean as appropriate] Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Enable CET virtualization for VMX and advertise to userspaceYang Weijiang1-0/+4
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is needed to enable CET on SVM, e.g. to context switch S_CET and other state. Disable KVM CET feature if unrestricted_guest is unsupported/disabled as KVM does not support emulating CET, as running without Unrestricted Guest can result in KVM emulating large swaths of guest code. While it's highly unlikely any guest will trigger emulation while also utilizing IBT or SHSTK, there's zero reason to allow CET without Unrestricted Guest as that combination should only be possible when explicitly disabling unrestricted_guest for testing purposes. Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces the presence of an Error Code based on exception vector, as attempting to inject a #CP with an Error Code (#CP architecturally has an Error Code) will fail due to the #CP vector historically not having an Error Code. Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note, KVM already clears guest CET state that is managed via XSTATE in kvm_xstate_reset(). Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Tested-by: Mathias Krause <minipli@grsecurity.net> Tested-by: John Allen <john.allen@amd.com> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: move some bits to separate patches, massage changelog] Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Initialize allow_smaller_maxphyaddr earlier in setupSean Christopherson1-15/+15
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM knows whether or not TDP will be utilized. To avoid having to teach KVM's emulator all about CET, KVM's upcoming CET virtualization support will be mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK and IBT if allow_smaller_maxphyaddr is enabled. In general, allow_smaller_maxphyaddr should be initialized as soon as possible since it's globally visible while its only input is whether or not EPT/NPT is enabled. I.e. there's effectively zero risk of setting allow_smaller_maxphyaddr too early, and substantial risk of setting it too late. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: x86: Merge 'svm' into 'cet' to pick up GHCB dependenciesSean Christopherson4-102/+131
Merge the queue of SVM changes for 6.18 to pick up the KVM-defined GHCB helpers so that kvm_ghcb_get_xss() can be used to virtualize CET for SEV-ES+ guests.
2025-09-23KVM: SVM: Enable AVIC by default for Zen4+ if x2AVIC is supportNaveen N Rao1-4/+36
AVIC and x2AVIC are fully functional since Zen 4, with no known hardware errata. Enable AVIC and x2AVIC by default on Zen4+ so long as x2AVIC is supported (to avoid enabling partial support for APIC virtualization by default). Internally, convert "avic" to an integer so that KVM can identify if the user has asked to explicitly enable or disable AVIC, i.e. so that KVM doesn't override an explicit 'y' from the user. Arbitrarily use -1 to denote auto-mode, and accept the string "auto" for the module param in addition to standard boolean values, i.e. continue to allow the user to configure the "avic" module parameter to explicitly enable/disable AVIC. To again maintain backward compatibility with a standard boolean param, set KERNEL_PARAM_OPS_FL_NOARG, which tells the params infrastructure to allow empty values for %true, i.e. to interpret a bare "avic" as "avic=y". Take care to check for a NULL @val when looking for "auto"! Lastly, always print "avic" as a boolean, since auto-mode is resolved during module initialization, i.e. the user should never see "auto" in sysfs. Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250919215934.1590410-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move global "avic" variable to avic.cSean Christopherson2-18/+26
Move "avic" to avic.c so that it's colocated with the other AVIC specific globals and module params, and so that avic_hardware_setup() is a bit more self-contained, e.g. similar to sev_hardware_setup(). Deliberately set enable_apicv in svm.c as it's already globally visible (defined by kvm.ko, not by kvm-amd.ko), and to clearly capture the dependency on enable_apicv being initialized (svm_hardware_setup() clears several AVIC-specific hooks when enable_apicv is disabled). Alternatively, clearing of the hooks (and enable_ipiv) could be moved to avic_hardware_setup(), but that's not obviously better, e.g. it's helpful to isolate the setting of enable_apicv when reading code from the generic x86 side of the world. No functional change intended. Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Don't advise the user to do force_avic=y (when x2AVIC is detected)Sean Christopherson1-4/+2
Don't advise the end user to try to force enable AVIC when x2AVIC is reported as supported in CPUID, as forcefully enabling AVIC isn't something that should be done lightly. E.g. some Zen4 client systems hide AVIC but leave x2AVIC behind, and while such a configuration is indeed due to buggy firmware in the sense the reporting x2AVIC without AVIC is nonsensical, KVM has no idea _why_ firmware disabled AVIC in the first place. Suggesting that the user try to run with force_avic=y is sketchy even if the user explicitly tries to enable AVIC, and will be downright irresponsible once KVM starts enabling AVIC by default. Alternatively, KVM could print the message only when the user explicitly asks for AVIC, but running with force_avic=y isn't something that should be encouraged for random users. force_avic is a useful knob for developers and perhaps even advanced users, but isn't something that KVM should advertise broadly. Opportunistically append a newline to the pr_warn() so that it prints out immediately, and tweak the message to say that AVIC is unsupported instead of disabled (disabled suggests that the kernel/KVM is somehow responsible). Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Always print "AVIC enabled" separately, even when force enabledSean Christopherson1-10/+9
Print the customary "AVIC enabled" informational message even when AVIC is force enabled on a system that doesn't advertise supported for AVIC in CPUID, as not printing the standard message can confuse users and tools. Opportunistically clean up the scary message when AVIC is force enabled, but keep it as separate message so that it is printed at level "warn", versus the standard message only being printed for level "info". Suggested-by: Naveen N Rao (AMD) <naveen@kernel.org> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Update "APICv in x2APIC without x2AVIC" in avic.c, not svm.cSean Christopherson3-6/+5
Set the "allow_apicv_in_x2apic_without_x2apic_virtualization" flag as part of avic_hardware_setup() instead of handling in svm_hardware_setup(), and make x2avic_enabled local to avic.c (setting the flag was the only use in svm.c). Tag avic_hardware_setup() with __init as necessary (it should have been tagged __init long ago). No functional change intended (aside from the side effects of tagging avic_hardware_setup() with __init). Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Move x2AVIC MSR interception helper to avic.cSean Christopherson3-53/+54
Move svm_set_x2apic_msr_interception() to avic.c as it's only relevant when x2AVIC is enabled/supported and only called by AVIC code. In addition to scoping AVIC code to avic.c, this will allow burying the global x2avic_enabled variable in avic. Opportunistically rename the helper to explicitly scope it to "avic". No functional change intended. Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Tested-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250919215934.1590410-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Make svm_x86_ops globally visible, clean up on-HyperV usageSean Christopherson4-32/+31
Make svm_x86_ops globally visible in anticipation of modifying the struct in avic.c, and clean up the KVM-on-HyperV usage, as declaring _and using_ a local variable in a header that's only defined in one specific .c-file is all kinds of ugly. Opportunistically make svm_hv_enable_l2_tlb_flush() local to svm_onhyperv.c, as the only reason it was visible was due to the aforementioned shenanigans in svm_onhyperv.h. Alternatively, svm_x86_ops could be explicitly passed to svm_hv_hardware_setup() as a parameter. While that approach is slightly safer, e.g. avoids "hidden" updates, for better or worse, the Intel side of KVM has already chosen to expose vt_x86_ops (and vt_init_ops). Given that svm_x86_ops is only truly consumed by kvm_ops_update, the odds of a "hidden" update causing problems are extremely low. So, absent a strong reason to rework the VMX/TDX code, make svm_x86_ops visible, as having all updates use exactly "svm_x86_ops." is advantageous in its own right. No functional change intended. Link: https://lore.kernel.org/r/20250919215934.1590410-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SVM: Re-load current, not host, TSC_AUX on #VMEXIT from SEV-ES guestHou Wenlong3-19/+18
Prior to running an SEV-ES guest, set TSC_AUX in the host save area to the current value in hardware, as tracked by the user return infrastructure, instead of always loading the host's desired value for the CPU. If the pCPU is also running a non-SEV-ES vCPU, loading the host's value on #VMEXIT could clobber the other vCPU's value, e.g. if the SEV-ES vCPU preempted the non-SEV-ES vCPU, in which case KVM expects the other vCPU's TSC_AUX value to be resident in hardware. Note, unlike TDX, which blindly _zeroes_ TSC_AUX on TD-Exit, SEV-ES CPUs can load an arbitrary value. Stuff the current value in the host save area instead of refreshing the user return cache so that KVM doesn't need to track whether or not the vCPU actually enterred the guest and thus loaded TSC_AUX from the host save area. Opportunistically tag tsc_aux_uret_slot as read-only after init to guard against unexpected modifications, and to make it obvious that using the variable in sev_es_prepare_switch_to_guest() is safe. Fixes: 916e3e5f26ab ("KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX") Cc: stable@vger.kernel.org Suggested-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> [sean: handle the SEV-ES case in sev_es_prepare_switch_to_guest()] Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250923153738.1875174-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Reject non-positive effective lengths during LAUNCH_UPDATESean Christopherson1-1/+1
Check for an invalid length during LAUNCH_UPDATE at the start of snp_launch_update() instead of subtly relying on kvm_gmem_populate() to detect the bad state. Code that directly handles userspace input absolutely should sanitize those inputs; failure to do so is asking for bugs where KVM consumes an invalid "npages". Keep the check in gmem, but wrap it in a WARN to flag any bad usage by the caller. Note, this is technically an ABI change as KVM would previously allow a length of '0'. But allowing a length of '0' is nonsensical and creates pointless conundrums in KVM. E.g. an empty range is arguably neither private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't be made private. In practice, no known or well-behaved VMM passes a length of '0'. Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between 1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with npages=0 when doing "npages = params.len / PAGE_SIZE". Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Michael Roth <michael.roth@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Validate XCR0 provided by guest in GHCBSean Christopherson1-4/+2
Use __kvm_set_xcr() to propagate XCR0 changes from the GHCB to KVM's software model in order to validate the new XCR0 against KVM's view of the supported XCR0. Allowing garbage is thankfully mostly benign, as kvm_load_{guest,host}_xsave_state() bail early for vCPUs with protected state, xstate_required_size() will simply provide garbage back to the guest, and attempting to save/restore the bad value via KVM_{G,S}ET_XCRS will only harm the guest (setting XCR0 will fail). However, allowing the guest to put junk into a field that KVM assumes is valid is a CVE waiting to happen. And as a bonus, using the proper API eliminates the ugly open coding of setting arch.cpuid_dynamic_bits_dirty. Simply ignore bad values, as either the guest managed to get an unsupported value into hardware, or the guest is misbehaving and providing pure garbage. In either case, KVM can't fix the broken guest. Note, using __kvm_set_xcr() also avoids recomputing dynamic CPUID bits if XCR0 isn't actually changing (relatively to KVM's previous snapshot). Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Read save fields from GHCB exactly onceSean Christopherson2-21/+26
Wrap all reads of GHCB save fields with READ_ONCE() via a KVM-specific GHCB get() utility to help guard against TOCTOU bugs. Using READ_ONCE() doesn't completely prevent such bugs, e.g. doesn't prevent KVM from redoing get() after checking the initial value, but at least addresses all potential TOCTOU issues in the current KVM code base. To prevent unintentional use of the generic helpers, take only @svm for the kvm_ghcb_get_xxx() helpers and retrieve the ghcb instead of explicitly passing it in. Opportunistically reduce the indentation of the macro-defined helpers and clean up the alignment. Fixes: 4e15a0ddc3ff ("KVM: SEV: snapshot the GHCB before accessing it") Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-23KVM: SEV: Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code()Sean Christopherson1-4/+4
Rename kvm_ghcb_get_sw_exit_code() to kvm_get_cached_sw_exit_code() to make it clear that KVM is getting the cached value, not reading directly from the guest-controlled GHCB. More importantly, vacating kvm_ghcb_get_sw_exit_code() will allow adding a KVM-specific macro-built kvm_ghcb_get_##field() helper to read values from the GHCB. No functional change intended. Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250919223258.1604852-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: x86/pmu: Move initialization of valid PMCs bitmask to common x86Sean Christopherson1-1/+0
Move all initialization of all_valid_pmc_idx to common code, as the logic is 100% common to Intel and AMD, and KVM heavily relies on Intel and AMD having the same semantics. E.g. the fact that AMD doesn't support fixed counters doesn't allow KVM to use all_valid_pmc_idx[63:32] for other purposes. Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: x86/pmu: Use BIT_ULL() instead of open coded equivalentsDapeng Mi1-2/+2
Replace a variety of "1ull << N" and "(u64)1 << N" snippets with BIT_ULL() in the PMU code. No functional change intended. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> [sean: split to separate patch, write changelog] Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: x86: Use KVM_REQ_RECALC_INTERCEPTS to react to CPUID updatesSean Christopherson1-3/+1
Defer recalculating MSR and instruction intercepts after a CPUID update via RECALC_INTERCEPTS to converge on RECALC_INTERCEPTS as the "official" mechanism for triggering recalcs. As a bonus, because KVM does a "recalc" during vCPU creation, and every functional VMM sets CPUID at least once, for all intents and purposes this saves at least one recalc. Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: x86: Rework KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTSSean Christopherson1-4/+4
Rework the MSR_FILTER_CHANGED request into a more generic RECALC_INTERCEPTS request, and expand the responsibilities of vendor code to recalculate all intercepts that vary based on userspace input, e.g. instruction intercepts that are tied to guest CPUID. Providing a generic recalc request will allow the upcoming mediated PMU support to trigger a recalc when PMU features, e.g. PERF_CAPABILITIES, are set by userspace, without having to make multiple calls to/from PMU code. As a bonus, using a request will effectively coalesce recalcs, e.g. will reduce the number of recalcs for normal usage from 3+ to 1 (vCPU create, set CPUID, set PERF_CAPABILITIES (Intel only), set filter). The downside is that MSR filter changes that are done in isolation will do a small amount of unnecessary work, but that's already a relatively slow path, and the cost of recalculating instruction intercepts is negligible. Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-18KVM: SVM: Check pmu->version, not enable_pmu, when getting PMC MSRsSean Christopherson1-1/+1
Gate access to PMC MSRs based on pmu->version, not on kvm->arch.enable_pmu, to more accurately reflect KVM's behavior. This is a glorified nop, as pmu->version and pmu->nr_arch_gp_counters can only be non-zero if amd_pmu_refresh() is reached, kvm_pmu_refresh() invokes amd_pmu_refresh() if and only if kvm->arch.enable_pmu is true, and amd_pmu_refresh() forces pmu->version to be 1 or 2. I.e. the following holds true: !pmu->nr_arch_gp_counters || kvm->arch.enable_pmu == (pmu->version > 0) and so the only way for amd_pmu_get_pmc() to return a non-NULL value is if both kvm->arch.enable_pmu and pmu->version evaluate to true. No real functional change intended. Reviewed-by: Sandipan Das <sandipan.das@amd.com> Tested-by: Xudong Hao <xudong.hao@intel.com> Link: https://lore.kernel.org/r/20250806195706.1650976-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-15KVM: SEV: Add SEV-SNP CipherTextHiding supportAshish Kalra1-1/+31
Ciphertext hiding prevents host accesses from reading the ciphertext of SNP guest private memory. Instead of reading ciphertext, the host reads will see constant default values (0xff). The SEV ASID space is split into SEV and SEV-ES/SEV-SNP ASID ranges. Enabling ciphertext hiding further splits the SEV-ES/SEV-SNP ASID space into separate ASID ranges for SEV-ES and SEV-SNP guests. Add a new off-by-default kvm-amd module parameter to enable ciphertext hiding and allow the admin to configure the SEV-ES and SEV-SNP ASID ranges. Simply cap the maximum SEV-SNP ASID as appropriate, i.e. don't reject loading KVM or disable ciphertest hiding for a too-big value, as KVM's general approach for module params is to sanitize inputs based on hardware/kernel support, not burn the world down. This also allows the admin to use -1u to assign all SEV-ES/SNP ASIDs to SNP without needing dedicated handling in KVM. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/95abc49edfde36d4fb791570ea2a4be6ad95fd0d.1755721927.git.ashish.kalra@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-15KVM: SEV: Introduce new min,max sev_es and sev_snp asid variablesAshish Kalra1-9/+27
Introduce new min, max sev_es_asid and sev_snp_asid variables. The new {min,max}_{sev_es,snp}_asid variables along with existing {min,max}_sev_asid variable simplifies partitioning of the SEV and SEV-ES+ ASID space. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Link: https://lore.kernel.org/r/1db48277e8e96a633d734786ea69bf830f014857.1755721927.git.ashish.kalra@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-11KVM: nSVM: Replace kzalloc() + copy_from_user() with memdup_user()Thorsten Blum1-11/+9
Replace kzalloc() followed by copy_from_user() with memdup_user() to improve and simplify svm_set_nested_state(). Return early if an error occurs instead of trying to allocate memory for 'save' when memory allocation for 'ctl' already failed. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20250903002951.118912-1-thorsten.blum@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-10KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is activeMaciej S. Szmigiero1-2/+1
Commit 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") inhibited pre-VMRUN sync of TPR from LAPIC into VMCB::V_TPR in sync_lapic_to_cr8() when AVIC is active. AVIC does automatically sync between these two fields, however it does so only on explicit guest writes to one of these fields, not on a bare VMRUN. This meant that when AVIC is enabled host changes to TPR in the LAPIC state might not get automatically copied into the V_TPR field of VMCB. This is especially true when it is the userspace setting LAPIC state via KVM_SET_LAPIC ioctl() since userspace does not have access to the guest VMCB. Practice shows that it is the V_TPR that is actually used by the AVIC to decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), so any leftover value in V_TPR will cause serious interrupt delivery issues in the guest when AVIC is enabled. Fix this issue by doing pre-VMRUN TPR sync from LAPIC into VMCB::V_TPR even when AVIC is enabled. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Cc: stable@vger.kernel.org Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/c231be64280b1461e854e1ce3595d70cde3a2e9d.1756139678.git.maciej.szmigiero@oracle.com [sean: tag for stable@] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-09-08KVM: SEV: Save the SEV policy if and only if LAUNCH_START succeedsSean Christopherson1-4/+2
Wait until LAUNCH_START fully succeeds to set a VM's SEV/SNP policy so that KVM doesn't keep a potentially stale policy. In practice, the issue is benign as the policy is only used to detect if the VMSA can be decrypted, and the VMSA only needs to be decrypted if LAUNCH_UPDATE and thus LAUNCH_START succeeded. Fixes: 962e2b6152ef ("KVM: SVM: Decrypt SEV VMSA in dump_vmcb() if debugging is enabled") Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Kim Phillips <kim.phillips@amd.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20250821213841.3462339-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-27KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappingsSean Christopherson2-3/+3
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings, not just private mappings, so that hugepage support plays nice with the upcoming support for backing non-private memory with guest_memfd. In addition to getting the max order from guest_memfd for gmem-only memslots, update TDX's hook to effectively ignore shared mappings, as TDX's restrictions on page size only apply to Secure EPT mappings. Do nothing for SNP, as RMP restrictions apply to both private and shared memory. Suggested-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Message-ID: <20250729225455.670324-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level()Ackerley Tng3-4/+4
Rename kvm_x86_ops.private_max_mapping_level() to .gmem_max_mapping_level() in anticipation of extending guest_memfd support to non-private memory. No functional change intended. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-ID: <20250729225455.670324-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-27KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()Fuad Tabba1-2/+2
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve clarity and accurately reflect its purpose. The function kvm_slot_can_be_private() was previously used to check if a given kvm_memory_slot is backed by guest_memfd. However, its name implied that the memory in such a slot was exclusively "private". As guest_memfd support expands to include non-private memory (e.g., shared host mappings), it's important to remove this association. The new name, kvm_slot_has_gmem(), states that the slot is backed by guest_memfd without making assumptions about the memory's privacy attributes. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-08-21KVM: SVM: Enable Secure TSC for SNP guestsNikunj A Dadhania1-0/+28
Add support for Secure TSC, allowing userspace to configure the Secure TSC feature for SNP guests. Use the SNP specification's desired TSC frequency parameter during the SNP_LAUNCH_START command to set the mean TSC frequency in KHz for Secure TSC enabled guests. Always use kvm->arch.arch.default_tsc_khz as the TSC frequency that is passed to SNP guests in the SNP_LAUNCH_START command. The default value is the host TSC frequency. The userspace can optionally change the TSC frequency via the KVM_SET_TSC_KHZ ioctl before calling the SNP_LAUNCH_START ioctl. Introduce the read-only MSR GUEST_TSC_FREQ (0xc0010134) that returns guest's effective frequency in MHZ when Secure TSC is enabled for SNP guests. Disable interception of this MSR when Secure TSC is enabled. Note that GUEST_TSC_FREQ MSR is accessible only to the guest and not from the hypervisor context. Co-developed-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com> Signed-off-by: Ketan Chaturvedi <Ketan.Chaturvedi@amd.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> [sean: contain Secure TSC to sev.c] Link: https://lore.kernel.org/r/20250819234833.3080255-9-seanjc@google.com [sean: return -EINVAL if TSC frequency is '0'] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Fold sev_es_vcpu_reset() into sev_vcpu_create()Sean Christopherson3-9/+2
Fold the remaining line of sev_es_vcpu_reset() into sev_vcpu_create() as there's no need for a dedicated RESET hook just to init a mutex, and the mutex should be initialized as early as possible anyways. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Set RESET GHCB MSR value during sev_es_init_vmcb()Sean Christopherson1-13/+11
Set the RESET value for the GHCB "MSR" during sev_es_init_vmcb() instead of sev_es_vcpu_reset() to allow for dropping sev_es_vcpu_reset() entirely. Note, the call to sev_init_vmcb() from sev_migrate_from() also kinda sorta emulates a RESET, but sev_migrate_from() immediately overwrites ghcb_gpa with the source's current value, so whether or not stuffing the GHCB version is correct/desirable is moot. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Move init of SNP guest state into sev_init_vmcb()Sean Christopherson3-16/+13
Move the initialization of SNP guest state from svm_vcpu_reset() into sev_init_vmcb() to reduce the number of paths that deal with INIT/RESET for SEV+ vCPUs from 4+ to 1. Plumb in @init_event as necessary. Opportunistically check for an SNP guest outside of sev_snp_init_protected_guest_state() so that sev_init_vmcb() is consistent with respect to checking for SEV-ES+ and SNP+ guests. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SVM: Move SEV-ES VMSA allocation to a dedicated sev_vcpu_create() helperSean Christopherson3-18/+29
Add a dedicated sev_vcpu_create() helper to allocate the VMSA page for SEV-ES+ vCPUs, and to allow for consolidating a variety of related SEV+ code in the near future. No functional change intended. Reviewed-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Enforce minimum GHCB version requirement for SEV-SNP guestsNikunj A Dadhania1-2/+6
Require a minimum GHCB version of 2 when starting SEV-SNP guests through KVM_SEV_INIT2. When a VMM attempts to start an SEV-SNP guest with an incompatible GHCB version (less than 2), reject the request early rather than allowing the guest kernel to start with an incorrect protocol version and fail later with GHCB_SNP_UNSUPPORTED guest termination. Not enforcing the minimum version typically causes the guest to request termination with GHCB_SNP_UNSUPPORTED error code: kvm_amd: SEV-ES guest requested termination: 0x0:0x2 Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version") Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Michael Roth <michael.roth@amd.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-21KVM: SEV: Drop GHCB_VERSION_DEFAULT and open code itNikunj A Dadhania1-9/+8
Remove the GHCB_VERSION_DEFAULT macro and open code it with '2'. The macro is used conditionally and is not a true default. KVM ABI does not advertise/emumerates the default GHCB version. Any future change to this macro would silently alter the ABI and potentially break existing deployments that rely on the current behavior. Additionally, move the GHCB version assignment earlier in the code flow and update the comment to clarify that KVM_SEV_INIT2 defaults to version 2, while KVM_SEV_INIT forces version 1. No functional change intended. Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Michael Roth <michael.roth@amd.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Link: https://lore.kernel.org/r/20250819234833.3080255-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Advertise support for the immediate form of MSR instructionsXin Li1-1/+5
Advertise support for the immediate form of MSR instructions to userspace if the instructions are supported by the underlying CPU, and KVM is using VMX, i.e. is running on an Intel-compatible CPU. For SVM, explicitly clear X86_FEATURE_MSR_IMM to ensure KVM doesn't over- report support if AMD-compatible CPUs ever implement the immediate forms, as SVM will likely require explicit enablement in KVM. Signed-off-by: Xin Li (Intel) <xin@zytor.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20250805202224.1475590-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Rename handle_fastpath_set_msr_irqoff() to handle_fastpath_wrmsr()Xin Li1-1/+1
Rename the WRMSR fastpath API to drop "irqoff", as that information is redundant (the fastpath always runs with IRQs disabled), and to prepare for adding a fastpath for the immediate variant of WRMSRNS. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> [sean: split to separate patch, write changelog] Link: https://lore.kernel.org/r/20250805202224.1475590-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: x86: Add a fastpath handler for INVDSean Christopherson1-0/+2
Add a fastpath handler for INVD so that the common fastpath logic can be trivially tested on both Intel and AMD. Under KVM, INVD is always: (a) intercepted, (b) available to the guest, and (c) emulated as a nop, with no side effects. Combined with INVD not having any inputs or outputs, i.e. no register constraints, INVD is the perfect instruction for exercising KVM's fastpath as it can be inserted into practically any guest-side code stream. Link: https://lore.kernel.org/r/20250805190526.1453366-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-08-19KVM: SVM: Skip fastpath emulation on VM-Exit if next RIP isn't validSean Christopherson1-2/+10
Skip the WRMSR and HLT fastpaths in SVM's VM-Exit handler if the next RIP isn't valid, e.g. because KVM is running with nrips=false. SVM must decode and emulate to skip the instruction if the CPU doesn't provide the next RIP, and getting the instruction bytes to decode requires reading guest memory. Reading guest memory through the emulator can fault, i.e. can sleep, which is disallowed since the fastpath handlers run with IRQs disabled. BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:106 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 32611, name: qemu preempt_count: 1, expected: 0 INFO: lockdep is turned off. irq event stamp: 30580 hardirqs last enabled at (30579): [<ffffffffc08b2527>] vcpu_run+0x1787/0x1db0 [kvm] hardirqs last disabled at (30580): [<ffffffffb4f62e32>] __schedule+0x1e2/0xed0 softirqs last enabled at (30570): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210 softirqs last disabled at (30568): [<ffffffffb4247a64>] fpu_swap_kvm_fpstate+0x44/0x210 CPU: 298 UID: 0 PID: 32611 Comm: qemu Tainted: G U 6.16.0-smp--e6c618b51cfe-sleep #782 NONE Tainted: [U]=USER Hardware name: Google Astoria-Turin/astoria, BIOS 0.20241223.2-0 01/17/2025 Call Trace: <TASK> dump_stack_lvl+0x7d/0xb0 __might_resched+0x271/0x290 __might_fault+0x28/0x80 kvm_vcpu_read_guest_page+0x8d/0xc0 [kvm] kvm_fetch_guest_virt+0x92/0xc0 [kvm] __do_insn_fetch_bytes+0xf3/0x1e0 [kvm] x86_decode_insn+0xd1/0x1010 [kvm] x86_emulate_instruction+0x105/0x810 [kvm] __svm_skip_emulated_instruction+0xc4/0x140 [kvm_amd] handle_fastpath_invd+0xc4/0x1a0 [kvm] vcpu_run+0x11a1/0x1db0 [kvm] kvm_arch_vcpu_ioctl_run+0x5cc/0x730 [kvm] kvm_vcpu_ioctl+0x578/0x6a0 [kvm] __se_sys_ioctl+0x6d/0xb0 do_syscall_64+0x8a/0x2c0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f479d57a94b </TASK> Note, this is essentially a reapply of commit 5c30e8101e8d ("KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"), but with different justification (KVM now grabs SRCU when skipping the instruction for other reasons). Fixes: b439eb8ab578 ("Revert "KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250805190526.1453366-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>