aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm/x86.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-13Merge commit 'its-for-linus-20250509-merge' into x86/core, to resolve conflictsIngo Molnar1-1/+3
Conflicts: Documentation/admin-guide/hw-vuln/index.rst arch/x86/include/asm/cpufeatures.h arch/x86/kernel/alternative.c arch/x86/kernel/cpu/bugs.c arch/x86/kernel/cpu/common.c drivers/base/cpu.c include/linux/cpu.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/msr' into x86/core, to resolve conflictsIngo Molnar1-19/+19
Conflicts: arch/x86/boot/startup/sme.c arch/x86/coco/sev/core.c arch/x86/kernel/fpu/core.c arch/x86/kernel/fpu/xstate.c Semantic conflict: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-11Merge tag 'its-for-linus-20250509' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 ITS mitigation from Dave Hansen: "Mitigate Indirect Target Selection (ITS) issue. I'd describe this one as a good old CPU bug where the behavior is _obviously_ wrong, but since it just results in bad predictions it wasn't wrong enough to notice. Well, the researchers noticed and also realized that thus bug undermined a bunch of existing indirect branch mitigations. Thus the unusually wide impact on this one. Details: ITS is a bug in some Intel CPUs that affects indirect branches including RETs in the first half of a cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the second half of a cacheline. Researchers at VUSec found this behavior and reported to Intel. Affected processors: - Cascade Lake, Cooper Lake, Whiskey Lake V, Coffee Lake R, Comet Lake, Ice Lake, Tiger Lake and Rocket Lake. Scope of impact: - Guest/host isolation: When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to direct branches in the guest. - Intra-mode using cBPF: cBPF can be used to poison the branch history to exploit ITS. Realigning the indirect branches and RETs mitigates this attack vector. - User/kernel: With eIBRS enabled user/kernel isolation is *not* impacted by ITS. - Indirect Branch Prediction Barrier (IBPB): Due to this bug indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This will be fixed in the microcode. Mitigation: As indirect branches in the first half of cacheline are affected, the mitigation is to replace those indirect branches with a call to thunk that is aligned to the second half of the cacheline. RETs that take prediction from RSB are not affected, but they may be affected by RSB-underflow condition. So, RETs in the first half of cacheline are also patched to a return thunk that executes the RET aligned to second half of cacheline" * tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftest/x86/bugs: Add selftests for ITS x86/its: FineIBT-paranoid vs ITS x86/its: Use dynamic thunks for indirect branches x86/ibt: Keep IBT disabled during alternative patching mm/execmem: Unify early execmem_cache behaviour x86/its: Align RETs in BHB clear sequence to avoid thunking x86/its: Add support for RSB stuffing mitigation x86/its: Add "vmexit" option to skip mitigation on some CPUs x86/its: Enable Indirect Target Selection mitigation x86/its: Add support for ITS-safe return thunk x86/its: Add support for ITS-safe indirect thunk x86/its: Enumerate Indirect Target Selection (ITS) bug Documentation: x86/bugs/its: Add ITS documentation
2025-05-09x86/its: Enumerate Indirect Target Selection (ITS) bugPawan Gupta1-1/+3
ITS bug in some pre-Alderlake Intel CPUs may allow indirect branches in the first half of a cache line get predicted to a target of a branch located in the second half of the cache line. Set X86_BUG_ITS on affected CPUs. Mitigation to follow in later commits. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
2025-05-02KVM: x86: Revert kvm_x86_ops.mem_enc_ioctl() back to an OPTIONAL hookSean Christopherson1-2/+5
Restore KVM's handling of a NULL kvm_x86_ops.mem_enc_ioctl, as the hook is NULL on SVM when CONFIG_KVM_AMD_SEV=n, and TDX will soon follow suit. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at arch/x86/include/asm/kvm-x86-ops.h:130 kvm_x86_vendor_init+0x178b/0x18e0 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.15.0-rc2-dc1aead1a985-sink-vm #2 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_x86_vendor_init+0x178b/0x18e0 Call Trace: <TASK> svm_init+0x2e/0x60 do_one_initcall+0x56/0x290 kernel_init_freeable+0x192/0x1e0 kernel_init+0x16/0x130 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Opportunistically drop the superfluous curly braces. Link: https://lore.kernel.org/all/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com Fixes: b2aaf38ced69 ("KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl") Link: https://lore.kernel.org/r/20250502203421.865686-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02Merge tag 'v6.15-rc4' into x86/msr, to pick up fixes and resolve conflictsIngo Molnar1-9/+23
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-29KVM: x86: Unify cross-vCPU IBPBSean Christopherson1-1/+19
Both SVM and VMX have similar implementation for executing an IBPB between running different vCPUs on the same CPU to create separate prediction domains for different vCPUs. For VMX, when the currently loaded VMCS is changed in vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which is the case on vCPU load. The intention is to execute an IBPB when switching vCPUs, but not when switching the VMCS within the same vCPU. Executing an IBPB on nested transitions within the same vCPU is handled separately and conditionally in nested_vmx_vmexit(). For SVM, the current VMCB is tracked on vCPU load and an IBPB is executed when it is changed. The intention is also to execute an IBPB when switching vCPUs, although it is possible that in some cases an IBBP is executed when switching VMCBs for the same vCPU. Executing an IBPB on nested transitions should be handled separately, and is proposed at [1]. Unify the logic by tracking the last loaded vCPU and execuintg the IBPB on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is destroyed, make sure all references to it are removed from any CPU. This is similar to how SVM clears the current_vmcb tracking on vCPU destruction. Remove the current VMCB tracking in SVM as it is no longer required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs(). [1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: tweak comment to stay at/under 80 columns] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Add module param to control and enumerate device posted IRQsSean Christopherson1-0/+6
Add a module param to each KVM vendor module to allow disabling device posted interrupts without having to sacrifice all of APICv/AVIC, and to also effectively enumerate to userspace whether or not KVM may be utilizing device posted IRQs. Disabling device posted interrupts is very desirable for testing, and can even be desirable for production environments, e.g. if the host kernel wants to interpose on device interrupts. Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match the overall APICv/AVIC controls, and to avoid complications with said controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the original user-defined value to play nice with a vendor module being reloaded with different enable_apicv settings. Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Rescan I/O APIC routes after EOI interception for old routingweizijie1-0/+1
Rescan I/O APIC routes for a vCPU after handling an intercepted I/O APIC EOI for an IRQ that is not targeting said vCPU, i.e. after handling what's effectively a stale EOI VM-Exit. If a level-triggered IRQ is in-flight when IRQ routing changes, e.g. because the guest changes routing from its IRQ handler, then KVM intercepts EOIs on both the new and old target vCPUs, so that the in-flight IRQ can be de-asserted when it's EOI'd. However, only the EOI for the in-flight IRQ needs to be intercepted, as IRQs on the same vector with the new routing are coincidental, i.e. occur only if the guest is reusing the vector for multiple interrupt sources. If the I/O APIC routes aren't rescanned, KVM will unnecessarily intercept EOIs for the vector and negative impact the vCPU's interrupt performance. Note, both commit db2bdcbbbd32 ("KVM: x86: fix edge EOI and IOAPIC reconfig race") and commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") mentioned this issue, but it was considered a "rare" occurrence thus was not addressed. However in real environments, this issue can happen even in a well-behaved guest. Cc: Kai Huang <kai.huang@intel.com> Co-developed-by: xuyun <xuyun_xy.xy@linux.alibaba.com> Signed-off-by: xuyun <xuyun_xy.xy@linux.alibaba.com> Signed-off-by: weizijie <zijie.wei@linux.alibaba.com> [sean: massage changelog and comments, use int/-1, reset at scan] Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250304013335.4155703-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: clean up a returnDan Carpenter1-1/+1
Returning a literal X86EMUL_CONTINUE is slightly clearer than returning rc. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/7604cbbf-15e6-45a8-afec-cf5be46c2924@stanley.mountain Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Check that the high 32bits are clear in kvm_arch_vcpu_ioctl_run()Dan Carpenter1-2/+2
The "kvm_run->kvm_valid_regs" and "kvm_run->kvm_dirty_regs" variables are u64 type. We are only using the lowest 3 bits but we want to ensure that the users are not passing invalid bits so that we can use the remaining bits in the future. However "sync_valid_fields" and kvm_sync_valid_fields() are u32 type so the check only ensures that the lower 32 bits are clear. Fix this by changing the types to u64. Fixes: 74c1807f6c4f ("KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/ec25aad1-113e-4c6e-8941-43d432251398@stanley.mountain Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24Merge branch 'kvm-fixes-6.15-rc4' into HEADPaolo Bonzini1-9/+19
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state.
2025-04-24KVM: x86: Do not use kvm_rip_read() unconditionally for KVM_PROFILINGAdrian Hunter1-1/+2
Not all VMs allow access to RIP. Check guest_state_protected before calling kvm_rip_read(). This avoids, for example, hitting WARN_ON_ONCE in vt_cache_reg() for TDX VMs. Fixes: 81bf912b2c15 ("KVM: TDX: Implement TDX vcpu enter/exit path") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20250415104821.247234-3-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Take irqfds.lock when adding/deleting IRQ bypass producerSean Christopherson1-2/+15
Take irqfds.lock when adding/deleting an IRQ bypass producer to ensure irqfd->producer isn't modified while kvm_irq_routing_update() is running. The only lock held when a producer is added/removed is irqbypass's mutex. Fixes: 872768800652 ("KVM: x86: select IRQ_BYPASS_MANAGER") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: x86: Explicitly treat routing entry type changes as changesSean Christopherson1-1/+2
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as-is. Fixes: 515a0c79e796 ("kvm: irqfd: avoid update unmodified entries of the routing") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-24KVM: arm64, x86: make kvm_arch_has_irq_bypass() inlinePaolo Bonzini1-5/+0
kvm_arch_has_irq_bypass() is a small function and even though it does not appear in any *really* hot paths, it's also not entirely rare. Make it inline---it also works out nicely in preparation for using it in kvm-intel.ko and kvm-amd.ko, since the function is not currently exported. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-10x86/msr: Rename 'wrmsrl_safe()' to 'wrmsrq_safe()'Ingo Molnar1-3/+3
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'Ingo Molnar1-6/+6
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'Ingo Molnar1-8/+8
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar1-2/+2
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini1-31/+68
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accessesSean Christopherson1-0/+4
Acquire a lock on kvm->srcu when userspace is getting MP state to handle a rather extreme edge case where "accepting" APIC events, i.e. processing pending INIT or SIPI, can trigger accesses to guest memory. If the vCPU is in L2 with INIT *and* a TRIPLE_FAULT request pending, then getting MP state will trigger a nested VM-Exit by way of ->check_nested_events(), and emuating the nested VM-Exit can access guest memory. The splat was originally hit by syzkaller on a Google-internal kernel, and reproduced on an upstream kernel by hacking the triple_fault_event_test selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario. ============================= WARNING: suspicious RCU usage 6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted ----------------------------- include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by triple_fault_ev/1256: #0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm] stack backtrace: CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x144/0x190 kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] read_and_check_msr_entry+0x2e/0x180 [kvm_intel] __nested_vmx_vmexit+0x550/0xde0 [kvm_intel] kvm_check_nested_events+0x1b/0x30 [kvm] kvm_apic_accept_events+0x33/0x100 [kvm] kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm] kvm_vcpu_ioctl+0x33e/0x9a0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401150504.829812-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-20Merge branch 'kvm-pre-tdx' into HEADPaolo Bonzini1-13/+27
- Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing.
2025-03-20Merge branch 'kvm-nvmx-and-vm-teardown' into HEADPaolo Bonzini1-26/+14
The immediate issue being fixed here is a nVMX bug where KVM fails to detect that, after nested VM-Exit, L1 has a pending IRQ (or NMI). However, checking for a pending interrupt accesses the legacy PIC, and x86's kvm_arch_destroy_vm() currently frees the PIC before destroying vCPUs, i.e. checking for IRQs during the forced nested VM-Exit results in a NULL pointer deref; that's a prerequisite for the nVMX fix. The remaining patches attempt to bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. E.g. KVM currently unloads each vCPU's MMUs in a separate operation from destroying vCPUs, all because when guest SMP support was added, KVM had a kludgy MMU teardown flow that broke when a VM had more than one 1 vCPU. And that oddity lived on, for 18 years... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-19Merge tag 'kvm-x86-xen-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-4/+9
KVM Xen changes for 6.15 - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates. - Restrict the Xen hypercall MSR indx to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range). - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config. - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks, as there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of Xen TSC should be rare, i.e. are not a hot path.
2025-03-19Merge tag 'kvm-x86-pvclock-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-58/+57
KVM PV clock changes for 6.15: - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock. - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus. - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks. - Restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests. - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups).
2025-03-19Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-67/+69
KVM x86 misc changes for 6.15: - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT. - Add a helper to consolidate handling of mp_state transitions, and use it to clear pv_unhalted whenever a vCPU is made RUNNABLE. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS, as KVM can't get the current CPL. - Misc cleanups
2025-03-14KVM: x86: remove shadow_memtype_maskPaolo Bonzini1-1/+3
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled, if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not called for the shadow paging case, there is no need to consult shadow_memtype_mask and it can be removed altogether. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PATYan Zhao1-1/+5
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Introduce supported_quirks to block disabling quirksYan Zhao1-4/+5
Introduce supported_quirks in kvm_caps to store platform-specific force-enabled quirks. No functional changes intended. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224070832.31394-1-yan.y.zhao@intel.com> [Remove unsupported quirks at KVM_ENABLE_CAP time. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Allow vendor code to disable quirksPaolo Bonzini1-0/+2
In some cases, the handling of quirks is split between platform-specific code and generic code, or it is done entirely in generic code, but the relevant bug does not trigger on some platforms; for example, this will be the case for "ignore guest PAT". Allow unaffected vendor modules to disable handling of a quirk for all VMs via a new entry in kvm_caps. Such quirks remain available in KVM_CAP_DISABLE_QUIRKS2, because that API tells userspace that KVM *knows* that some of its past behavior was bogus or just undesirable. In other words, it's plausible for userspace to refuse to run if a quirk is not listed by KVM_CAP_DISABLE_QUIRKS2, so preserve that and make it part of the API. As an example, mark KVM_X86_QUIRK_CD_NW_CLEARED as auto-disabled on Intel systems. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: do not allow re-enabling quirksPaolo Bonzini1-1/+1
Allowing arbitrary re-enabling of quirks puts a limit on what the quirks themselves can do, since you cannot assume that the quirk prevents a particular state. More important, it also prevents KVM from disabling a quirk at VM creation time, because userspace can always go back and re-enable that. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Move KVM_MAX_MCE_BANKS to header fileIsaku Yamahata1-1/+0
Move KVM_MAX_MCE_BANKS to header file so that it can be used for TDX in a future patch. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: split into new patch] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-8-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Force APICv active for TDX guestIsaku Yamahata1-0/+6
Force APICv active for TDX guests in KVM because APICv is always enabled by TDX module. From the view of KVM, whether APICv state is active or not is decided by: 1. APIC is hw enabled 2. VM and vCPU have no inhibit reasons set. After TDX vCPU init, APIC is set to x2APIC mode. KVM_SET_{SREGS,SREGS2} are rejected due to has_protected_state for TDs and guest_state_protected for TDX vCPUs are set. Reject KVM_{GET,SET}_LAPIC from userspace since migration is not supported yet, so that userspace cannot disable APIC. For various APICv inhibit reasons: - APICV_INHIBIT_REASON_DISABLED is impossible after checking enable_apicv in tdx_bringup(). If !enable_apicv, TDX support will be disabled. - APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED is impossible since x2APIC is mandatory, KVM emulates APIC_ID as read-only for x2APIC mode. (Note: APICV_INHIBIT_REASON_PHYSICAL_ID_ALIASED could be set if the memory allocation fails for KVM apic_map.) - APICV_INHIBIT_REASON_HYPERV is impossible since TDX doesn't support HyperV guest yet. - APICV_INHIBIT_REASON_ABSENT is impossible since in-kernel LAPIC is checked in tdx_vcpu_create(). - APICV_INHIBIT_REASON_BLOCKIRQ is impossible since TDX doesn't support KVM_SET_GUEST_DEBUG. - APICV_INHIBIT_REASON_APIC_ID_MODIFIED is impossible since x2APIC is mandatory. - APICV_INHIBIT_REASON_APIC_BASE_MODIFIED is impossible since KVM rejects userspace to set APIC base. - The rest inhibit reasons are relevant only to AMD's AVIC, including APICV_INHIBIT_REASON_NESTED, APICV_INHIBIT_REASON_IRQWIN, APICV_INHIBIT_REASON_PIT_REINJ, APICV_INHIBIT_REASON_SEV, and APICV_INHIBIT_REASON_LOGICAL_ID_ALIASED. (For APICV_INHIBIT_REASON_PIT_REINJ, similar to AVIC, KVM can't intercept EOI for TDX guests neither, but KVM enforces KVM_IRQCHIP_SPLIT for TDX guests, which eliminates the in-kernel PIT.) Implement vt_refresh_apicv_exec_ctrl() to call KVM_BUG_ON() if APICv is disabled for TDX guests. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014757.897978-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Handle TDX PV MMIO hypercallSean Christopherson1-0/+1
Handle TDX PV MMIO hypercall when TDX guest calls TDVMCALL with the leaf #VE.RequestMMIO (same value as EXIT_REASON_EPT_VIOLATION) according to TDX Guest Host Communication Interface (GHCI) spec. For TDX guests, VMM is not allowed to access vCPU registers and the private memory, and the code instructions must be fetched from the private memory. So MMIO emulation implemented for non-TDX VMs is not possible for TDX guests. In TDX the MMIO regions are instead configured by VMM to trigger a #VE exception in the guest. The #VE handling is supposed to emulate the MMIO instruction inside the guest and convert it into a TDVMCALL with the leaf #VE.RequestMMIO, which equals to EXIT_REASON_EPT_VIOLATION. The requested MMIO address must be in shared GPA space. The shared bit is stripped after check because the existing code for MMIO emulation is not aware of the shared bit. The MMIO GPA shouldn't have a valid memslot, also the attribute of the GPA should be shared. KVM could do the checks before exiting to userspace, however, even if KVM does the check, there still will be race conditions between the check in KVM and the emulation of MMIO access in userspace due to a memslot hotplug, or a memory attribute conversion. If userspace doesn't check the attribute of the GPA and the attribute happens to be private, it will not pose a security risk or cause an MCE, but it can lead to another issue. E.g., in QEMU, treating a GPA with private attribute as shared when it falls within RAM's range can result in extra memory consumption during the emulation to the access to the HVA of the GPA. There are two options: 1) Do the check both in KVM and userspace. 2) Do the check only in QEMU. This patch chooses option 2, i.e. KVM omits the memslot and attribute checks, and expects userspace to do the checks. Similar to normal MMIO emulation, try to handle the MMIO in kernel first, if kernel can't support it, forward the request to userspace. Export needed symbols used for MMIO handling. Fragments handling is not needed for TDX PV MMIO because GPA is provided, if a MMIO access crosses page boundary, it should be continuous in GPA. Also, the size is limited to 1, 2, 4, 8 bytes. No further split needed. Allow cross page access because no extra handling needed after checking both start and end GPA are shared GPAs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250222014225.897298-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Move pv_unhalted check out of kvm_vcpu_has_events()Binbin Wu1-6/+5
Move pv_unhalted check out of kvm_vcpu_has_events(), check pv_unhalted explicitly when handling PV unhalt and expose kvm_vcpu_has_events(). kvm_vcpu_has_events() returns true if pv_unhalted is set, and pv_unhalted is only cleared on transitions to KVM_MP_STATE_RUNNABLE. If the guest initiates a spurious wakeup, pv_unhalted could be left set in perpetuity. Currently, this is not problematic because kvm_vcpu_has_events() is only called when handling PV unhalt. However, if kvm_vcpu_has_events() is used for other purposes in the future, it could return the unexpected results. Export kvm_vcpu_has_events() for its usage in broader contexts. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250222014225.897298-3-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Have ____kvm_emulate_hypercall() read the GPRsBinbin Wu1-7/+8
Have ____kvm_emulate_hypercall() read the GPRs instead of passing them in via the macro. When emulating KVM hypercalls via TDVMCALL, TDX will marshall registers of TDVMCALL ABI into KVM's x86 registers to match the definition of KVM hypercall ABI _before_ ____kvm_emulate_hypercall() gets called. Therefore, ____kvm_emulate_hypercall() can just read registers internally based on KVM hypercall ABI, and those registers can be removed from the __kvm_emulate_hypercall() macro. Also, op_64_bit can be determined inside ____kvm_emulate_hypercall(), remove it from the __kvm_emulate_hypercall() macro as well. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20250222014225.897298-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behaviorIsaku Yamahata1-1/+3
Add a flag KVM_DEBUGREG_AUTO_SWITCH to skip saving/restoring guest DRs. TDX-SEAM unconditionally saves/restores guest DRs on TD exit/enter, and resets DRs to architectural INIT state on TD exit. Use the new flag KVM_DEBUGREG_AUTO_SWITCH to indicate that KVM doesn't need to save/restore guest DRs. KVM still needs to restore host DRs after TD exit if there are active breakpoints in the host, which is covered by the existing code. MOV-DR exiting is always cleared for TDX guests, so the handler for DR access is never called, and KVM_DEBUGREG_WONT_EXIT is never set. Add a warning if both KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_AUTO_SWITCH are set. Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT(). Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20241210004946.3718496-2-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-13-adrian.hunter@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsrChao Gao1-5/+19
Several MSRs are constant and only used in userspace(ring 3). But VMs may have different values. KVM uses kvm_set_user_return_msr() to switch to guest's values and leverages user return notifier to restore them when the kernel is to return to userspace. To eliminate unnecessary wrmsr, KVM also caches the value it wrote to an MSR last time. TDX module unconditionally resets some of these MSRs to architectural INIT state on TD exit. It makes the cached values in kvm_user_return_msrs are inconsistent with values in hardware. This inconsistency needs to be fixed. Otherwise, it may mislead kvm_on_user_return() to skip restoring some MSRs to the host's values. kvm_set_user_return_msr() can help correct this case, but it is not optimal as it always does a wrmsr. So, introduce a variation of kvm_set_user_return_msr() to update cached values and skip that wrmsr. Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250129095902.16391-9-adrian.hunter@intel.com> Reviewed-by: Xiayao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Make cpu_dirty_log_size a per-VM valueYan Zhao1-3/+3
Make cpu_dirty_log_size (CPU's dirty log buffer size) a per-VM value and set the per-VM cpu_dirty_log_size only for normal VMs when PML is enabled. Do not set it for TDs. Until now, cpu_dirty_log_size was a system-wide value that is used for all VMs and is set to the PML buffer size when PML was enabled in VMX. However, PML is not currently supported for TDs, though PML remains available for normal VMs as long as the feature is supported by hardware and enabled in VMX. Making cpu_dirty_log_size a per-VM value allows it to be ther PML buffer size for normal VMs and 0 for TDs. This allows functions like kvm_arch_sync_dirty_log() and kvm_mmu_update_cpu_dirty_logging() to determine if PML is supported, in order to kick off vCPUs or request them to update CPU dirty logging status (turn on/off PML in VMCS). This fixes an issue first reported in [1], where QEMU attaches an emulated VGA device to a TD; note that KVM_MEM_LOG_DIRTY_PAGES still works if the corresponding has no flag KVM_MEM_GUEST_MEMFD. KVM then invokes kvm_mmu_update_cpu_dirty_logging() and from there vmx_update_cpu_dirty_logging(), which incorrectly accesses a kvm_vmx struct for a TDX VM. Reported-by: ANAND NARSHINHA PATIL <Anand.N.Patil@ibm.com> Reported-by: Pedro Principeza <pedro.principeza@canonical.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://github.com/canonical/tdx/issues/202 Link: https://github.com/canonical/tdx/issues/202 [1] Suggested-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Do TDX specific vcpu initializationIsaku Yamahata1-0/+7
TD guest vcpu needs TDX specific initialization before running. Repurpose KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command KVM_TDX_INIT_VCPU, and implement the callback for it. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix comment: https://lore.kernel.org/kvm/Z36OYfRW9oPjW8be@google.com/ (Sean) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Support per-VM KVM_CAP_MAX_VCPUS extension checkIsaku Yamahata1-0/+2
Change to report the KVM_CAP_MAX_VCPUS extension from globally to per-VM to allow userspace to be able to query maximum vCPUs for TDX guest via checking the KVM_CAP_MAX_VCPU extension on per-VM basis. Today KVM x86 reports KVM_MAX_VCPUS as guest's maximum vCPUs for all guests globally, and userspace, i.e. Qemu, queries the KVM_MAX_VCPUS extension globally but not on per-VM basis. TDX has its own limit of maximum vCPUs it can support for all TDX guests in addition to KVM_MAX_VCPUS. TDX module reports this limit via the MAX_VCPU_PER_TD global metadata. Different modules may report different values. In practice, the reported value reflects the maximum logical CPUs that ALL the platforms that the module supports can possibly have. Note some old modules may also not support this metadata, in which case the limit is U16_MAX. The current way to always report KVM_MAX_VCPUS in the KVM_CAP_MAX_VCPUS extension is not enough for TDX. To accommodate TDX, change to report the KVM_CAP_MAX_VCPUS extension on per-VM basis. Specifically, override kvm->max_vcpus in tdx_vm_init() for TDX guest, and report kvm->max_vcpus in the KVM_CAP_MAX_VCPUS extension check. Change to report "the number of logical CPUs the platform has" as the maximum vCPUs for TDX guest. Simply forwarding the MAX_VCPU_PER_TD reported by the TDX module would result in an unpredictable ABI because the reported value to userspace would be depending on whims of TDX modules. This works in practice because of the MAX_VCPU_PER_TD reported by the TDX module will never be smaller than the one reported to userspace. But to make sure KVM never reports an unsupported value, sanity check the MAX_VCPU_PER_TD reported by TDX module is not smaller than the number of logical CPUs the platform has, otherwise refuse to use TDX. Note, when creating a TDX guest, TDX actually requires the "maximum vCPUs for _this_ TDX guest" as an input to initialize the TDX guest. But TDX guest's maximum vCPUs is not part of TDREPORT thus not part of attestation, thus there's no need to allow userspace to explicitly _configure_ the maximum vCPUs on per-VM basis. KVM will simply use kvm->max_vcpus as input when initializing the TDX guest. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: create/destroy VM structureIsaku Yamahata1-0/+1
Implement managing the TDX private KeyID to implement, create, destroy and free for a TDX guest. When creating at TDX guest, assign a TDX private KeyID for the TDX guest for memory encryption, and allocate pages for the guest. These are used for the Trust Domain Root (TDR) and Trust Domain Control Structure (TDCS). On destruction, free the allocated pages, and the KeyID. Before tearing down the private page tables, TDX requires the guest TD to be destroyed by reclaiming the KeyID. Do it in the vm_pre_destroy() kvm_x86_ops hook. The TDR control structures can be freed in the vm_destroy() hook, which runs last. Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Fix build issue in kvm-coco-queue - Init ret earlier to fix __tdx_td_init() error handling. (Chao) - Standardize -EAGAIN for __tdx_td_init() retry errors (Rick) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctlIsaku Yamahata1-4/+0
KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for guest state-protected VM. It defined subcommands for technology-specific operations under KVM_MEMORY_ENCRYPT_OP. Despite its name, the subcommands are not limited to memory encryption, but various technology-specific operations are defined. It's natural to repurpose KVM_MEMORY_ENCRYPT_OP for TDX specific operations and define subcommands. Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op. TDX specific sub-commands will be added to retrieve/pass TDX specific parameters. Make mem_enc_ioctl non-optional as it's always filled. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Tony Lindgren <tony.lindgren@linux.intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> --- - Drop the misleading "defined for consistency" line. It's a copy-paste error introduced in the earlier patches. Earlier there was padding at the end to match struct kvm_sev_cmd size. (Tony) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protectedPaolo Bonzini1-4/+11
KVM_CAP_SYNC_REGS does not make sense for VMs with protected guest state, since the register values cannot actually be written. Return 0 when using the VM-level KVM_CHECK_EXTENSION ioctl, and accordingly return -EINVAL from KVM_RUN if the valid/dirty fields are nonzero. However, on exit from KVM_RUN userspace could have placed a nonzero value into kvm_run->kvm_valid_regs, so check guest_state_protected again and skip store_regs() in that case. Cc: stable@vger.kernel.org Fixes: 517987e3fb19 ("KVM: x86: add fields to struct kvm_arch for CoCo features") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250306202923.646075-1-pbonzini@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Add infrastructure for secure TSCIsaku Yamahata1-2/+9
Add guest_tsc_protected member to struct kvm_arch_vcpu and prohibit changing TSC offset/multiplier when guest_tsc_protected is true. X86 confidential computing technology defines protected guest TSC so that the VMM can't change the TSC offset/multiplier once vCPU is initialized. SEV-SNP defines Secure TSC as optional, whereas TDX mandates it. KVM has common logic on x86 that tries to guess or adjust TSC offset/multiplier for better guest TSC and TSC interrupt latency at KVM vCPU creation (kvm_arch_vcpu_postcreate()), vCPU migration over pCPU (kvm_arch_vcpu_load()), vCPU TSC device attributes (kvm_arch_tsc_set_attr()) and guest/host writing to TSC or TSC adjust MSR (kvm_set_msr_common()). The current x86 KVM implementation conflicts with protected TSC because the VMM can't change the TSC offset/multiplier. Because KVM emulates the TSC timer or the TSC deadline timer with the TSC offset/multiplier, the TSC timer interrupts is injected to the guest at the wrong time if the KVM TSC offset is different from what the TDX module determined. Originally this issue was found by cyclic test of rt-test [1] as the latency in TDX case is worse than VMX value + TDX SEAMCALL overhead. It turned out that the KVM TSC offset is different from what the TDX module determines. Disable or ignore the KVM logic to change/adjust the TSC offset/multiplier somehow, thus keeping the KVM TSC offset/multiplier the same as the value of the TDX module. Writes to MSR_IA32_TSC are also blocked as they amount to a change in the TSC offset. [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <3a7444aec08042fe205666864b6858910e86aa98.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Push down setting vcpu.arch.user_set_tscIsaku Yamahata1-6/+6
Push down setting vcpu.arch.user_set_tsc to true from kvm_synchronize_tsc() to __kvm_synchronize_tsc() so that the two callers don't have to modify user_set_tsc directly as preparation. Later, prohibit changing TSC synchronization for TDX guests to modify __kvm_synchornize_tsc() change. We don't want to touch caller sites not to change user_set_tsc. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-ID: <62b1a7a35d6961844786b6e47e8ecb774af7a228.1728719037.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: move vm_destroy callback at end of kvm_arch_destroy_vmPaolo Bonzini1-1/+1
TDX needs to free the TDR control structures last, after all paging structures have been torn down; move the vm_destroy callback at a suitable place. The new place is also okay for AMD; the main difference is that the MMU has been torn down and, if anything, that is better done before the SNP ASID is released. Extracted from a patch by Yan Zhao. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-09Merge tag 'kvm-x86-fixes-6.14-rcN.2' of https://github.com/kvm-x86/linux ↵Paolo Bonzini1-0/+2
into HEAD KVM x86 fixes for 6.14-rcN #2 - Set RFLAGS.IF in C code on SVM to get VMRUN out of the STI shadow. - Ensure DEBUGCTL is context switched on AMD to avoid running the guest with the host's value, which can lead to unexpected bus lock #DBs. - Suppress DEBUGCTL.BTF on AMD (to match Intel), as KVM doesn't properly emulate BTF. KVM's lack of context switching has meant BTF has always been broken to some extent. - Always save DR masks for SNP vCPUs if DebugSwap is *supported*, as the guest can enable DebugSwap without KVM's knowledge. - Fix a bug in mmu_stress_tests where a vCPU could finish the "writes to RO memory" phase without actually generating a write-protection fault. - Fix a printf() goof in the SEV smoke test that causes build failures with -Werror. - Explicitly zero EAX and EBX in CPUID.0x8000_0022 output when PERFMON_V2 isn't supported by KVM.
2025-02-28KVM: x86: Always set mp_state to RUNNABLE on wakeup from HLTSean Christopherson1-3/+2
When emulating HLT and a wake event is already pending, explicitly mark the vCPU RUNNABLE (via kvm_set_mp_state()) instead of assuming the vCPU is already in the appropriate state. Barring a KVM bug, it should be impossible for the vCPU to be in a non-RUNNABLE state, but there is no advantage to relying on that to hold true, and ensuring the vCPU is made RUNNABLE avoids non-deterministic behavior with respect to pv_unhalted. E.g. if the vCPU is not already RUNNABLE, then depending on when pv_unhalted is set, KVM could either leave the vCPU in the non-RUNNABLE state (set before __kvm_emulate_halt()), or transition the vCPU to HALTED and then RUNNABLE (pv_unhalted set after the kvm_vcpu_has_events() check). Link: https://lore.kernel.org/r/20250224174156.2362059-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>