| Age | Commit message (Collapse) | Author | Files | Lines |
|
KVM x86 fixes for 6.18:
- Expand the KVM_PRE_FAULT_MEMORY selftest to add a regression test for the
bug fixed by commit 3ccbf6f47098 ("KVM: x86/mmu: Return -EAGAIN if userspace
deletes/moves memslot during prefault")
- Don't try to get PMU capabbilities from perf when running a CPU with hybrid
CPUs/PMUs, as perf will rightly WARN.
- Rework KVM_CAP_GUEST_MEMFD_MMAP (newly introduced in 6.18) into a more
generic KVM_CAP_GUEST_MEMFD_FLAGS
- Add a guest_memfd INIT_SHARED flag and require userspace to explicitly set
said flag to initialize memory as SHARED, irrespective of MMAP. The
behavior merged in 6.18 is that enabling mmap() implicitly initializes
memory as SHARED, which would result in an ABI collision for x86 CoCo VMs
as their memory is currently always initialized PRIVATE.
- Allow mmap() on guest_memfd for x86 CoCo VMs, i.e. on VMs with private
memory, to enable testing such setups, i.e. to hopefully flush out any
other lurking ABI issues before 6.18 is officially released.
- Add testcases to the guest_memfd selftest to cover guest_memfd without MMAP,
and host userspace accesses to mmap()'d private memory.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more x86 updates from Borislav Petkov:
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C
- Replace KVM fastops with C-based stubs which avoids problems with the
fastop infra related to latter not adhering to the C ABI due to their
special calling convention and, more importantly, bypassing compiler
control-flow integrity checking because they're written in asm
- Remove wrongly used static branches and other ugliness accumulated
over time in hyperv's hypercall implementation with a proper static
function call to the correct hypervisor call variant
- Add some fixes and modifications to allow running FRED-enabled
kernels in KVM even on non-FRED hardware
- Add kCFI improvements like validating indirect calls and prepare for
enabling kCFI with GCC. Add cmdline params documentation and other
code cleanups
- Use the single-byte 0xd6 insn as the official #UD single-byte
undefined opcode instruction as agreed upon by both x86 vendors
- Other smaller cleanups and touchups all over the place
* tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86,retpoline: Optimize patch_retpoline()
x86,ibt: Use UDB instead of 0xEA
x86/cfi: Remove __noinitretpoline and __noretpoline
x86/cfi: Add "debug" option to "cfi=" bootparam
x86/cfi: Standardize on common "CFI:" prefix for CFI reports
x86/cfi: Document the "cfi=" bootparam options
x86/traps: Clarify KCFI instruction layout
compiler_types.h: Move __nocfi out of compiler-specific header
objtool: Validate kCFI calls
x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y
x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware
x86/fred: Install system vector handlers even if FRED isn't fully enabled
x86/hyperv: Use direct call to hypercall-page
x86/hyperv: Clean up hv_do_hypercall()
KVM: x86: Remove fastops
KVM: x86: Convert em_salc() to C
KVM: x86: Introduce EM_ASM_3WCL
KVM: x86: Introduce EM_ASM_1SRC2
KVM: x86: Introduce EM_ASM_2CL
KVM: x86: Introduce EM_ASM_2W
...
|
|
Allow mmap() on guest_memfd instances for x86 VMs with private memory as
the need to track private vs. shared state in the guest_memfd instance is
only pertinent to INIT_SHARED. Doing mmap() on private memory isn't
terrible useful (yet!), but it's now possible, and will be desirable when
guest_memfd gains support for other VMA-based syscalls, e.g. mbind() to
set NUMA policy.
Lift the restriction now, before MMAP support is officially released, so
that KVM doesn't need to add another capability to enumerate support for
mmap() on private memory.
Fixes: 3d3a04fad25a ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20251003232606.4070510-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly zero kvm_host_pmu instead of attempting to get the perf PMU
capabilities when running on a hybrid CPU to avoid running afoul of perf's
sanity check.
------------[ cut here ]------------
WARNING: arch/x86/events/core.c:3089 at perf_get_x86_pmu_capability+0xd/0xc0,
Call Trace:
<TASK>
kvm_x86_vendor_init+0x1b0/0x1a40 [kvm]
vmx_init+0xdb/0x260 [kvm_intel]
vt_init+0x12/0x9d0 [kvm_intel]
do_one_initcall+0x60/0x3f0
do_init_module+0x97/0x2b0
load_module+0x2d08/0x2e30
init_module_from_file+0x96/0xe0
idempotent_init_module+0x117/0x330
__x64_sys_finit_module+0x73/0xe0
Always read the capabilities for non-hybrid CPUs, i.e. don't entirely
revert to reading capabilities if and only if KVM wants to use a PMU, as
it may be useful to have the host PMU capabilities available, e.g. if only
or debug.
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/all/70b64347-2aca-4511-af78-a767d5fa8226@intel.com/
Fixes: 51f34b1e650f ("KVM: x86/pmu: Snapshot host (i.e. perf's) reported PMU capabilities")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20251010005239.146953-1-dapeng1.mi@linux.intel.com
[sean: rework changelog, call out hybrid CPUs in shortlog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Unify guest entry code for KVM and MSHV (Sean Christopherson)
- Switch Hyper-V MSI domain to use msi_create_parent_irq_domain()
(Nam Cao)
- Add CONFIG_HYPERV_VMBUS and limit the semantics of CONFIG_HYPERV
(Mukesh Rathor)
- Add kexec/kdump support on Azure CVMs (Vitaly Kuznetsov)
- Deprecate hyperv_fb in favor of Hyper-V DRM driver (Prasanna
Kumar T S M)
- Miscellaneous enhancements, fixes and cleanups (Abhishek Tiwari,
Alok Tiwari, Nuno Das Neves, Wei Liu, Roman Kisel, Michael Kelley)
* tag 'hyperv-next-signed-20251006' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hyperv: Remove the spurious null directive line
MAINTAINERS: Mark hyperv_fb driver Obsolete
fbdev/hyperv_fb: deprecate this in favor of Hyper-V DRM driver
Drivers: hv: Make CONFIG_HYPERV bool
Drivers: hv: Add CONFIG_HYPERV_VMBUS option
Drivers: hv: vmbus: Fix typos in vmbus_drv.c
Drivers: hv: vmbus: Fix sysfs output format for ring buffer index
Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()
x86/hyperv: Switch to msi_create_parent_irq_domain()
mshv: Use common "entry virt" APIs to do work in root before running guest
entry: Rename "kvm" entry code assets to "virt" to genericize APIs
entry/kvm: KVM: Move KVM details related to signal/-EINTR into KVM proper
mshv: Handle NEED_RESCHED_LAZY before transferring to guest
x86/hyperv: Add kexec/kdump support on Azure CVMs
Drivers: hv: Simplify data structures for VMBus channel close message
Drivers: hv: util: Cosmetic changes for hv_utils_transport.c
mshv: Add support for a new parent partition configuration
clocksource: hyper-v: Skip unnecessary checks for the root partition
hyperv: Add missing field to hv_output_map_device_interrupt
|
|
Pull x86 kvm updates from Paolo Bonzini:
"Generic:
- Rework almost all of KVM's exports to expose symbols only to KVM's
x86 vendor modules (kvm-{amd,intel}.ko and PPC's kvm-{pr,hv}.ko
x86:
- Rework almost all of KVM x86's exports to expose symbols only to
KVM's vendor modules, i.e. to kvm-{amd,intel}.ko
- Add support for virtualizing Control-flow Enforcement Technology
(CET) on Intel (Shadow Stacks and Indirect Branch Tracking) and AMD
(Shadow Stacks).
It is worth noting that while SHSTK and IBT can be enabled
separately in CPUID, it is not really possible to virtualize them
separately. Therefore, Intel processors will really allow both
SHSTK and IBT under the hood if either is made visible in the
guest's CPUID. The alternative would be to intercept
XSAVES/XRSTORS, which is not feasible for performance reasons
- Fix a variety of fuzzing WARNs all caused by checking L1 intercepts
when completing userspace I/O. KVM has already committed to
allowing L2 to to perform I/O at that point
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the
MSR is supposed to exist for v2 PMUs
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside
of forced emulation and other contrived testing scenarios)
- Clean up the MSR APIs in preparation for CET and FRED
virtualization, as well as mediated vPMU support
- Clean up a pile of PMU code in anticipation of adding support for
mediated vPMUs
- Reject in-kernel IOAPIC/PIT for TDX VMs, as KVM can't obtain EOI
vmexits needed to faithfully emulate an I/O APIC for such guests
- Many cleanups and minor fixes
- Recover possible NX huge pages within the TDP MMU under read lock
to reduce guest jitter when restoring NX huge pages
- Return -EAGAIN during prefault if userspace concurrently
deletes/moves the relevant memslot, to fix an issue where
prefaulting could deadlock with the memslot update
x86 (AMD):
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is
supported
- Require a minimum GHCB version of 2 when starting SEV-SNP guests
via KVM_SEV_INIT2 so that invalid GHCB versions result in immediate
errors instead of latent guest failures
- Add support for SEV-SNP's CipherText Hiding, an opt-in feature that
prevents unauthorized CPU accesses from reading the ciphertext of
SNP guest private memory, e.g. to attempt an offline attack. This
feature splits the shared SEV-ES/SEV-SNP ASID space into separate
ranges for SEV-ES and SEV-SNP guests, therefore a new module
parameter is needed to control the number of ASIDs that can be used
for VMs with CipherText Hiding vs. how many can be used to run
SEV-ES guests
- Add support for Secure TSC for SEV-SNP guests, which prevents the
untrusted host from tampering with the guest's TSC frequency, while
still allowing the the VMM to configure the guest's TSC frequency
prior to launch
- Validate the XCR0 provided by the guest (via the GHCB) to avoid
bugs resulting from bogus XCR0 values
- Save an SEV guest's policy if and only if LAUNCH_START fully
succeeds to avoid leaving behind stale state (thankfully not
consumed in KVM)
- Explicitly reject non-positive effective lengths during SNP's
LAUNCH_UPDATE instead of subtly relying on guest_memfd to deal with
them
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the
host's desired TSC_AUX, to fix a bug where KVM was keeping a
different vCPU's TSC_AUX in the host MSR until return to userspace
KVM (Intel):
- Preparation for FRED support
- Don't retry in TDX's anti-zero-step mitigation if the target
memslot is invalid, i.e. is being deleted or moved, to fix a
deadlock scenario similar to the aforementioned prefaulting case
- Misc bugfixes and minor cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (142 commits)
KVM: x86: Export KVM-internal symbols for sub-modules only
KVM: x86: Drop pointless exports of kvm_arch_xxx() hooks
KVM: x86: Move kvm_intr_is_single_vcpu() to lapic.c
KVM: Export KVM-internal symbols for sub-modules only
KVM: s390/vfio-ap: Use kvm_is_gpa_in_memslot() instead of open coded equivalent
KVM: VMX: Make CR4.CET a guest owned bit
KVM: selftests: Verify MSRs are (not) in save/restore list when (un)supported
KVM: selftests: Add coverage for KVM-defined registers in MSRs test
KVM: selftests: Add KVM_{G,S}ET_ONE_REG coverage to MSRs test
KVM: selftests: Extend MSRs test to validate vCPUs without supported features
KVM: selftests: Add support for MSR_IA32_{S,U}_CET to MSRs test
KVM: selftests: Add an MSR test to exercise guest/host and read/write
KVM: x86: Define AMD's #HV, #VC, and #SX exception vectors
KVM: x86: Define Control Protection Exception (#CP) vector
KVM: x86: Add human friendly formatting for #XM, and #VE
KVM: SVM: Enable shadow stack virtualization for SVM
KVM: SEV: Synchronize MSR_IA32_XSS from the GHCB when it's valid
KVM: SVM: Pass through shadow stack MSRs as appropriate
KVM: SVM: Update dump_vmcb with shadow stack save area additions
KVM: nSVM: Save/load CET Shadow Stack state to/from vmcb12/vmcb02
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 TDX updates from Dave Hansen:
"The biggest change here is making TDX and kexec play nicely together.
Before this, the memory encryption hardware (which doesn't respect
cache coherency) could write back old cachelines on top of data in the
new kernel, so kexec and TDX were made mutually exclusive. This
removes the limitation.
There is also some work to tighten up a hardware bug workaround and
some MAINTAINERS updates.
- Make TDX and kexec work together
- Skip TDX bug workaround when the bug is not present
- Update maintainers entries"
* tag 'x86_tdx_for_6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/virt/tdx: Use precalculated TDVPR page physical address
KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs
x86/virt/tdx: Update the kexec section in the TDX documentation
x86/virt/tdx: Remove the !KEXEC_CORE dependency
x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum
x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL
x86/sme: Use percpu boolean to control WBINVD during kexec
x86/kexec: Consolidate relocate_kernel() function parameters
x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present
x86/tdx: Tidy reset_pamt functions
x86/tdx: Eliminate duplicate code in tdx_clear_page()
MAINTAINERS: Add KVM mail list to the TDX entry
MAINTAINERS: Add Rick Edgecombe as a TDX reviewer
MAINTAINERS: Update the file list in the TDX entry.
|
|
Pull kvm updates from Paolo Bonzini:
"This excludes the bulk of the x86 changes, which I will send
separately. They have two not complex but relatively unusual conflicts
so I will wait for other dust to settle.
guest_memfd:
- Add support for host userspace mapping of guest_memfd-backed memory
for VM types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE
(which isn't precisely the same thing as CoCo VMs, since x86's
SEV-MEM and SEV-ES have no way to detect private vs. shared).
This lays the groundwork for removal of guest memory from the
kernel direct map, as well as for limited mmap() for
guest_memfd-backed memory.
For more information see:
- commit a6ad54137af9 ("Merge branch 'guest-memfd-mmap' into HEAD")
- guest_memfd in Firecracker:
https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
- direct map removal:
https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
- mmap support:
https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
ARM:
- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
allowing more registers to be used as part of the message payload.
- Change the way pKVM allocates its VM handles, making sure that the
privileged hypervisor is never tricked into using uninitialised
data.
- Speed up MMIO range registration by avoiding unnecessary RCU
synchronisation, which results in VMs starting much quicker.
- Add the dump of the instruction stream when panic-ing in the EL2
payload, just like the rest of the kernel has always done. This
will hopefully help debugging non-VHE setups.
- Add 52bit PA support to the stage-1 page-table walker, and make use
of it to populate the fault level reported to the guest on failing
to translate a stage-1 walk.
- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
feature parity for guests, irrespective of the host platform.
- Fix some really ugly architecture problems when dealing with debug
in a nested VM. This has some bad performance impacts, but is at
least correct.
- Add enough infrastructure to be able to disable EL2 features and
give effective values to the EL2 control registers. This then
allows a bunch of features to be turned off, which helps cross-host
migration.
- Large rework of the selftest infrastructure to allow most tests to
transparently run at EL2. This is the first step towards enabling
NV testing.
- Various fixes and improvements all over the map, including one BE
fix, just in time for the removal of the feature.
LoongArch:
- Detect page table walk feature on new hardware
- Add sign extension with kernel MMIO/IOCSR emulation
- Improve in-kernel IPI emulation
- Improve in-kernel PCH-PIC emulation
- Move kvm_iocsr tracepoint out of generic code
RISC-V:
- Added SBI FWFT extension for Guest/VM with misaligned delegation
and pointer masking PMLEN features
- Added ONE_REG interface for SBI FWFT extension
- Added Zicbop and bfloat16 extensions for Guest/VM
- Enabled more common KVM selftests for RISC-V
- Added SBI v3.0 PMU enhancements in KVM and perf driver
s390:
- Improve interrupt cpu for wakeup, in particular the heuristic to
decide which vCPU to deliver a floating interrupt to.
- Clear the PTE when discarding a swapped page because of CMMA; this
bug was introduced in 6.16 when refactoring gmap code.
x86 selftests:
- Add #DE coverage in the fastops test (the only exception that's
guest- triggerable in fastop-emulated instructions).
- Fix PMU selftests errors encountered on Granite Rapids (GNR),
Sierra Forest (SRF) and Clearwater Forest (CWF).
- Minor cleanups and improvements
x86 (guest side):
- For the legacy PCI hole (memory between TOLUD and 4GiB) to UC when
overriding guest MTRR for TDX/SNP to fix an issue where ACPI
auto-mapping could map devices as WB and prevent the device drivers
from mapping their devices with UC/UC-.
- Make kvm_async_pf_task_wake() a local static helper and remove its
export.
- Use native qspinlocks when running in a VM with dedicated
vCPU=>pCPU bindings even when PV_UNHALT is unsupported.
Generic:
- Remove a redundant __GFP_NOWARN from kvm_setup_async_pf() as
__GFP_NOWARN is now included in GFP_NOWAIT.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (178 commits)
KVM: s390: Fix to clear PTE when discarding a swapped page
KVM: arm64: selftests: Cover ID_AA64ISAR3_EL1 in set_id_regs
KVM: arm64: selftests: Remove a duplicate register listing in set_id_regs
KVM: arm64: selftests: Cope with arch silliness in EL2 selftest
KVM: arm64: selftests: Add basic test for running in VHE EL2
KVM: arm64: selftests: Enable EL2 by default
KVM: arm64: selftests: Initialize HCR_EL2
KVM: arm64: selftests: Use the vCPU attr for setting nr of PMU counters
KVM: arm64: selftests: Use hyp timer IRQs when test runs at EL2
KVM: arm64: selftests: Select SMCCC conduit based on current EL
KVM: arm64: selftests: Provide helper for getting default vCPU target
KVM: arm64: selftests: Alias EL1 registers to EL2 counterparts
KVM: arm64: selftests: Create a VGICv3 for 'default' VMs
KVM: arm64: selftests: Add unsanitised helpers for VGICv3 creation
KVM: arm64: selftests: Add helper to check for VGICv3 support
KVM: arm64: selftests: Initialize VGICv3 only once
KVM: arm64: selftests: Provide kvm_arch_vm_post_create() in library code
KVM: selftests: Add ex_str() to print human friendly name of exception vectors
selftests/kvm: remove stale TODO in xapic_state_test
KVM: selftests: Handle Intel Atom errata that leads to PMU event overcount
...
|
|
Rename the "kvm" entry code files and Kconfigs to use generic "virt"
nomenclature so that the code can be reused by other hypervisors (or
rather, their root/dom0 partition drivers), without incorrectly suggesting
the code somehow relies on and/or involves KVM.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Move KVM's morphing of pending signals into userspace exits into KVM
proper, and drop the @vcpu param from xfer_to_guest_mode_handle_work().
How KVM responds to -EINTR is a detail that really belongs in KVM itself,
and invoking kvm_handle_signal_exit() from kernel code creates an inverted
module dependency. E.g. attempting to move kvm_handle_signal_exit() into
kvm_main.c would generate an linker error when building kvm.ko as a module.
Dropping KVM details will also converting the KVM "entry" code into a more
generic virtualization framework so that it can be used when running as a
Hyper-V root partition.
Lastly, eliminating usage of "struct kvm_vcpu" outside of KVM is also nice
to have for KVM x86 developers, as keeping the details of kvm_vcpu purely
within KVM allows changing the layout of the structure without having to
boot into a new kernel, e.g. allows rebuilding and reloading kvm.ko with a
modified kvm_vcpu structure as part of debug/development.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Core perf code updates:
- Convert mmap() related reference counts to refcount_t. This is in
reaction to the recently fixed refcount bugs, which could have been
detected earlier and could have mitigated the bug somewhat (Thomas
Gleixner, Peter Zijlstra)
- Clean up and simplify the callchain code, in preparation for
sframes (Steven Rostedt, Josh Poimboeuf)
Uprobes updates:
- Add support to optimize usdt probes on x86-64, which gives a
substantial speedup (Jiri Olsa)
- Cleanups and fixes on x86 (Peter Zijlstra)
PMU driver updates:
- Various optimizations and fixes to the Intel PMU driver (Dapeng Mi)
Misc cleanups and fixes:
- Remove redundant __GFP_NOWARN (Qianfeng Rong)"
* tag 'perf-core-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
selftests/bpf: Fix uprobe_sigill test for uprobe syscall error value
uprobes/x86: Return error from uprobe syscall when not called from trampoline
perf: Skip user unwind if the task is a kernel thread
perf: Simplify get_perf_callchain() user logic
perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL
perf: Have get_perf_callchain() return NULL if crosstask and user are set
perf: Remove get_perf_callchain() init_nr argument
perf/x86: Print PMU counters bitmap in x86_pmu_show_pmu_cap()
perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK
perf/x86/intel: Change macro GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48)
perf/x86: Add PERF_CAP_PEBS_TIMING_INFO flag
perf/x86/intel: Fix IA32_PMC_x_CFG_B MSRs access error
perf/x86/intel: Use early_initcall() to hook bts_init()
uprobes: Remove redundant __GFP_NOWARN
selftests/seccomp: validate uprobe syscall passes through seccomp
seccomp: passthrough uprobe systemcall without filtering
selftests/bpf: Fix uprobe syscall shadow stack test
selftests/bpf: Change test_uretprobe_regs_change for uprobe and uretprobe
selftests/bpf: Add uprobe_regs_equal test
selftests/bpf: Add optimized usdt variant for basic usdt test
...
|
|
Rework almost all of KVM x86's exports to expose symbols only to KVM's
vendor modules, i.e. to kvm-{amd,intel}.ko. Keep the generic exports that
are guarded by CONFIG_KVM_EXTERNAL_WRITE_TRACKING=y, as they're explicitly
designed/intended for external usage.
Link: https://lore.kernel.org/r/20250919003303.1355064-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Drop the exporting of several kvm_arch_xxx() hooks that are only called
from arch-neutral code, i.e. that are only called from kvm.ko.
Link: https://lore.kernel.org/r/20250919003303.1355064-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move kvm_intr_is_single_vcpu() to lapic.c, drop its export, and make its
"fast" helper local to lapic.c. kvm_intr_is_single_vcpu() is only usable
if the local APIC is in-kernel, i.e. it most definitely belongs in the
local APIC code.
No functional change intended.
Fixes: cf04ec393ed0 ("KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU")
Link: https://lore.kernel.org/r/20250919003303.1355064-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 CET virtualization support for 6.18
Add support for virtualizing Control-flow Enforcement Technology (CET) on
Intel (Shadow Stacks and Indirect Branch Tracking) and AMD (Shadow Stacks).
CET is comprised of two distinct features, Shadow Stacks (SHSTK) and Indirect
Branch Tracking (IBT), that can be utilized by software to help provide
Control-flow integrity (CFI). SHSTK defends against backward-edge attacks
(a.k.a. Return-oriented programming (ROP)), while IBT defends against
forward-edge attacks (a.k.a. similarly CALL/JMP-oriented programming (COP/JOP)).
Attackers commonly use ROP and COP/JOP methodologies to redirect the control-
flow to unauthorized targets in order to execute small snippets of code,
a.k.a. gadgets, of the attackers choice. By chaining together several gadgets,
an attacker can perform arbitrary operations and circumvent the system's
defenses.
SHSTK defends against backward-edge attacks, which execute gadgets by modifying
the stack to branch to the attacker's target via RET, by providing a second
stack that is used exclusively to track control transfer operations. The
shadow stack is separate from the data/normal stack, and can be enabled
independently in user and kernel mode.
When SHSTK is is enabled, CALL instructions push the return address on both the
data and shadow stack. RET then pops the return address from both stacks and
compares the addresses. If the return addresses from the two stacks do not
match, the CPU generates a Control Protection (#CP) exception.
IBT defends against backward-edge attacks, which branch to gadgets by executing
indirect CALL and JMP instructions with attacker controlled register or memory
state, by requiring the target of indirect branches to start with a special
marker instruction, ENDBRANCH. If an indirect branch is executed and the next
instruction is not an ENDBRANCH, the CPU generates a #CP. Note, ENDBRANCH
behaves as a NOP if IBT is disabled or unsupported.
From a virtualization perspective, CET presents several problems. While SHSTK
and IBT have two layers of enabling, a global control in the form of a CR4 bit,
and a per-feature control in user and kernel (supervisor) MSRs (U_CET and S_CET
respectively), the {S,U}_CET MSRs can be context switched via XSAVES/XRSTORS.
Practically speaking, intercepting and emulating XSAVES/XRSTORS is not a viable
option due to complexity, and outright disallowing use of XSTATE to context
switch SHSTK/IBT state would render the features unusable to most guests.
To limit the overall complexity without sacrificing performance or usability,
simply ignore the potential virtualization hole, but ensure that all paths in
KVM treat SHSTK/IBT as usable by the guest if the feature is supported in
hardware, and the guest has access to at least one of SHSTK or IBT. I.e. allow
userspace to advertise one of SHSTK or IBT if both are supported in hardware,
even though doing so would allow a misbehaving guest to use the unadvertised
feature.
Fully emulating SHSTK and IBT would also require significant complexity, e.g.
to track and update branch state for IBT, and shadow stack state for SHSTK.
Given that emulating large swaths of the guest code stream isn't necessary on
modern CPUs, punt on emulating instructions that meaningful impact or consume
SHSTK or IBT. However, instead of doing nothing, explicitly reject emulation
of such instructions so that KVM's emulator can't be abused to circumvent CET.
Disable support for SHSTK and IBT if KVM is configured such that emulation of
arbitrary guest instructions may be required, specifically if Unrestricted
Guest (Intel only) is disabled, or if KVM will emulate a guest.MAXPHYADDR that
is smaller than host.MAXPHYADDR.
Lastly disable SHSTK support if shadow paging is enabled, as the protections
for the shadow stack are novel (shadow stacks require Writable=0,Dirty=1, so
that they can't be directly modified by software), i.e. would require
non-trivial support in the Shadow MMU.
Note, AMD CPUs currently only support SHSTK. Explicitly disable IBT support
so that KVM doesn't over-advertise if AMD CPUs add IBT, and virtualizing IBT
in SVM requires KVM modifications.
|
|
KVM x86 changes for 6.18
- Don't (re)check L1 intercepts when completing userspace I/O to fix a flaw
where a misbehaving usersepace (a.k.a. syzkaller) could swizzle L1's
intercepts and trigger a variety of WARNs in KVM.
- Emulate PERF_CNTR_GLOBAL_STATUS_SET for PerfMonV2 guests, as the MSR is
supposed to exist for v2 PMUs.
- Allow Centaur CPU leaves (base 0xC000_0000) for Zhaoxin CPUs.
- Clean up KVM's vector hashing code for delivering lowest priority IRQs.
- Clean up the fastpath handler code to only handle IPIs and WRMSRs that are
actually "fast", as opposed to handling those that KVM _hopes_ are fast, and
in the process of doing so add fastpath support for TSC_DEADLINE writes on
AMD CPUs.
- Clean up a pile of PMU code in anticipation of adding support for mediated
vPMUs.
- Add support for the immediate forms of RDMSR and WRMSRNS, sans full
emulator support (KVM should never need to emulate the MSRs outside of
forced emulation and other contrived testing scenarios).
- Clean up the MSR APIs in preparation for CET and FRED virtualization, as
well as mediated vPMU support.
- Rejecting a fully in-kernel IRQCHIP if EOIs are protected, i.e. for TDX VMs,
as KVM can't faithfully emulate an I/O APIC for such guests.
- KVM_REQ_MSR_FILTER_CHANGED into a generic RECALC_INTERCEPTS in preparation
for mediated vPMU support, as KVM will need to recalculate MSR intercepts in
response to PMU refreshes for guests with mediated vPMUs.
- Misc cleanups and minor fixes.
|
|
HEAD
KVM SEV-SNP CipherText Hiding support for 6.18
Add support for SEV-SNP's CipherText Hiding, an opt-in feature that prevents
unauthorized CPU accesses from reading the ciphertext of SNP guest private
memory, e.g. to attempt an offline attack. Instead of ciphertext, the CPU
will always read back all FFs when CipherText Hiding is enabled.
Add new module parameter to the KVM module to enable CipherText Hiding and
control the number of ASIDs that can be used for VMs with CipherText Hiding,
which is in effect the number of SNP VMs. When CipherText Hiding is enabled,
the shared SEV-ES/SEV-SNP ASID space is split into separate ranges for SEV-ES
and SEV-SNP guests, i.e. ASIDs that can be used for CipherText Hiding cannot
be used to run SEV-ES guests.
|
|
KVM SVM changes for 6.18
- Require a minimum GHCB version of 2 when starting SEV-SNP guests via
KVM_SEV_INIT2 so that invalid GHCB versions result in immediate errors
instead of latent guest failures.
- Add support for Secure TSC for SEV-SNP guests, which prevents the untrusted
host from tampering with the guest's TSC frequency, while still allowing the
the VMM to configure the guest's TSC frequency prior to launch.
- Mitigate the potential for TOCTOU bugs when accessing GHCB fields by
wrapping all accesses via READ_ONCE().
- Validate the XCR0 provided by the guest (via the GHCB) to avoid tracking a
bogous XCR0 value in KVM's software model.
- Save an SEV guest's policy if and only if LAUNCH_START fully succeeds to
avoid leaving behind stale state (thankfully not consumed in KVM).
- Explicitly reject non-positive effective lengths during SNP's LAUNCH_UPDATE
instead of subtly relying on guest_memfd to do the "heavy" lifting.
- Reload the pre-VMRUN TSC_AUX on #VMEXIT for SEV-ES guests, not the host's
desired TSC_AUX, to fix a bug where KVM could clobber a different vCPU's
TSC_AUX due to hardware not matching the value cached in the user-return MSR
infrastructure.
- Enable AVIC by default for Zen4+ if x2AVIC (and other prereqs) is supported,
and clean up the AVIC initialization code along the way.
|
|
KVM VMX changes for 6.18
- Add read/write helpers for MSRs that need to be accessed with preemption
disable to prepare for virtualizing FRED RSP0.
- Fix a bug where KVM would return 0/success from __tdx_bringup() on error,
i.e. where KVM would load with enable_tdx=true despite TDX not being usable.
- Minor cleanups.
|
|
KVM x86 MMU changes for 6.18
- Recover possible NX huge pages within the TDP MMU under read lock to
reduce guest jitter when restoring NX huge pages.
- Return -EAGAIN during prefault if userspace concurrently deletes/moves the
relevant memslot to fix an issue where prefaulting could deadlock with the
memslot update.
- Don't retry in TDX's anti-zero-step mitigation if the target memslot is
invalid, i.e. is being deleted or moved, to fix a deadlock scenario similar
to the aforementioned prefaulting case.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.18
1. Add PTW feature detection on new hardware.
2. Add sign extension with kernel MMIO/IOCSR emulation.
3. Improve in-kernel IPI emulation.
4. Improve in-kernel PCH-PIC emulation.
5. Move kvm_iocsr tracepoint out of generic code.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.18
- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
allowing more registers to be used as part of the message payload.
- Change the way pKVM allocates its VM handles, making sure that the
privileged hypervisor is never tricked into using uninitialised
data.
- Speed up MMIO range registration by avoiding unnecessary RCU
synchronisation, which results in VMs starting much quicker.
- Add the dump of the instruction stream when panic-ing in the EL2
payload, just like the rest of the kernel has always done. This will
hopefully help debugging non-VHE setups.
- Add 52bit PA support to the stage-1 page-table walker, and make use
of it to populate the fault level reported to the guest on failing
to translate a stage-1 walk.
- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
feature parity for guests, irrespective of the host platform.
- Fix some really ugly architecture problems when dealing with debug
in a nested VM. This has some bad performance impacts, but is at
least correct.
- Add enough infrastructure to be able to disable EL2 features and
give effective values to the EL2 control registers. This then allows
a bunch of features to be turned off, which helps cross-host
migration.
- Large rework of the selftest infrastructure to allow most tests to
transparently run at EL2. This is the first step towards enabling
NV testing.
- Various fixes and improvements all over the map, including one BE
fix, just in time for the removal of the feature.
|
|
Make CR4.CET a guest-owned bit under VMX by extending
KVM_POSSIBLE_CR4_GUEST_BITS accordingly.
There's no need to intercept changes to CR4.CET, as it's neither
included in KVM's MMU role bits, nor does KVM specifically care about
the actual value of a (nested) guest's CR4.CET value, beside for
enforcing architectural constraints, i.e. make sure that CR0.WP=1 if
CR4.CET=1.
Intercepting writes to CR4.CET is particularly bad for grsecurity
kernels with KERNEXEC or, even worse, KERNSEAL enabled. These features
heavily make use of read-only kernel objects and use a cpu-local CR0.WP
toggle to override it, when needed. Under a CET-enabled kernel, this
also requires toggling CR4.CET, hence the motivation to make it
guest-owned.
Using the old test from [1] gives the following runtime numbers (perf
stat -r 5 ssdd 10 50000):
* grsec guest on linux-6.16-rc5 + cet patches:
2.4647 +- 0.0706 seconds time elapsed ( +- 2.86% )
* grsec guest on linux-6.16-rc5 + cet patches + CR4.CET guest-owned:
1.5648 +- 0.0240 seconds time elapsed ( +- 1.53% )
Not only does not intercepting CR4.CET make the test run ~35% faster,
it's also more stable with less fluctuation due to fewer VMEXITs.
Therefore, make CR4.CET a guest-owned bit where possible.
This change is VMX-specific, as SVM has no such fine-grained control
register intercept control.
If KVM's assumptions regarding MMU role handling wrt. a guest's CR4.CET
value ever change, the BUILD_BUG_ON()s related to KVM_MMU_CR4_ROLE_BITS
and KVM_POSSIBLE_CR4_GUEST_BITS will catch that early.
Link: https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/ [1]
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-52-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add {HV,CP,SX}_VECTOR definitions for AMD's Hypervisor Injection Exception,
VMM Communication Exception, and SVM Security Exception vectors, along with
human friendly formatting for trace_kvm_inj_exception().
Note, KVM is all but guaranteed to never observe or inject #SX, and #HV is
also unlikely to go unused. Add the architectural collateral mostly for
completeness, and on the off chance that hardware goes off the rails.
Link: https://lore.kernel.org/r/20250919223258.1604852-44-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a CP_VECTOR definition for CET's Control Protection Exception (#CP),
along with human friendly formatting for trace_kvm_inj_exception().
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-43-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add XM_VECTOR and VE_VECTOR pretty-printing for
trace_kvm_inj_exception().
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-42-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove the explicit clearing of shadow stack CPU capabilities.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-41-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Synchronize XSS from the GHCB to KVM's internal tracking if the guest
marks XSS as valid on a #VMGEXIT. Like XCR0, KVM needs an up-to-date copy
of XSS in order to compute the required XSTATE size when emulating
CPUID.0xD.0x1 for the guest.
Treat the incoming XSS change as an emulated write, i.e. validatate the
guest-provided value, to avoid letting the guest load garbage into KVM's
tracking. Simply ignore bad values, as either the guest managed to get an
unsupported value into hardware, or the guest is misbehaving and providing
pure garbage. In either case, KVM can't fix the broken guest.
Explicitly allow access to XSS at all times, as KVM needs to ensure its
copy of XSS stays up-to-date. E.g. KVM supports migration of SEV-ES guests
and so needs to allow the host to save/restore XSS, otherwise a guest
that *knows* its XSS hasn't change could get stale/bad CPUID emulation if
the guest doesn't provide XSS in the GHCB on every exit. This creates a
hypothetical problem where a guest could request emulation of RDMSR or
WRMSR on XSS, but arguably that's not even a problem, e.g. it would be
entirely reasonable for a guest to request "emulation" as a way to inform
the hypervisor that its XSS value has been modified.
Note, emulating the change as an MSR write also takes care of side effects,
e.g. marking dynamic CPUID bits as dirty.
Suggested-by: John Allen <john.allen@amd.com>
base-commit: 14298d819d5a6b7180a4089e7d2121ca3551dc6c
Link: https://lore.kernel.org/r/20250919223258.1604852-40-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Pass through XSAVE managed CET MSRs on SVM when KVM supports shadow
stack. These cannot be intercepted without also intercepting XSAVE which
would likely cause unacceptable performance overhead.
MSR_IA32_INT_SSP_TAB is not managed by XSAVE, so it is intercepted.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-39-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add shadow stack VMCB fields to dump_vmcb. PL0_SSP, PL1_SSP, PL2_SSP,
PL3_SSP, and U_CET are part of the SEV-ES save area and are encrypted,
but can be decrypted and dumped if the guest policy allows debugging.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: John Allen <john.allen@amd.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Transfer the three CET Shadow Stack VMCB fields (S_CET, ISST_ADDR, and
SSP) on VMRUN, #VMEXIT, and loading nested state (saving nested state
simply copies the entire save area). SVM doesn't provide a way to
disallow L1 from enabling Shadow Stacks for L2, i.e. KVM *must* provide
nested support before advertising SHSTK to userspace.
Link: https://lore.kernel.org/r/20250919223258.1604852-37-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Emulate shadow stack MSR access by reading and writing to the
corresponding fields in the VMCB.
Signed-off-by: John Allen <john.allen@amd.com>
[sean: mark VMCB_CET dirty/clean as appropriate]
Link: https://lore.kernel.org/r/20250919223258.1604852-36-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise the LOAD_CET_STATE VM-Entry/Exit control bits in the nested VMX
MSRS, as all nested support for CET virtualization, including consistency
checks, is in place.
Advertise support if and only if KVM supports at least one of IBT or SHSTK.
While it's userspace's responsibility to provide a consistent CPU model to
the guest, that doesn't mean KVM should set userspace up to fail.
Note, the existing {CLEAR,LOAD}_BNDCFGS behavior predates
KVM_X86_QUIRK_STUFF_FEATURE_MSRS, i.e. KVM "solved" the inconsistent CPU
model problem by overwriting the VMX MSRs provided by userspace.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-35-seanjc@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Introduce consistency checks for CET states during nested VM-entry.
A VMCS contains both guest and host CET states, each comprising the
IA32_S_CET MSR, SSP, and IA32_INTERRUPT_SSP_TABLE_ADDR MSR. Various
checks are applied to CET states during VM-entry as documented in SDM
Vol3 Chapter "VM ENTRIES". Implement all these checks during nested
VM-entry to emulate the architectural behavior.
In summary, there are three kinds of checks on guest/host CET states
during VM-entry:
A. Checks applied to both guest states and host states:
* The IA32_S_CET field must not set any reserved bits; bits 10 (SUPPRESS)
and 11 (TRACKER) cannot both be set.
* SSP should not have bits 1:0 set.
* The IA32_INTERRUPT_SSP_TABLE_ADDR field must be canonical.
B. Checks applied to host states only
* IA32_S_CET MSR and SSP must be canonical if the CPU enters 64-bit mode
after VM-exit. Otherwise, IA32_S_CET and SSP must have their higher 32
bits cleared.
C. Checks applied to guest states only:
* IA32_S_CET MSR and SSP are not required to be canonical (i.e., 63:N-1
are identical, where N is the CPU's maximum linear-address width). But,
bits 63:N of SSP must be identical.
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-34-seanjc@google.com
[sean: have common helper return 0/-EINVAL, not true/false]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add consistency checks for CR4.CET and CR0.WP in guest-state or host-state
area in the VMCS12. This ensures that configurations with CR4.CET set and
CR0.WP not set result in VM-entry failure, aligning with architectural
behavior.
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.
vmcs12 and vmcs02 needs to be synced when L2 exits to L1 or when L1 wants
to resume L2, that way correct CET states can be observed by one another.
Please note that consistency checks regarding CET state during VM-Entry
will be added later to prevent this patch from becoming too large.
Advertising the new CET VM_ENTRY/EXIT control bits are also be deferred
until after the consistency checks are added.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Tested-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"
Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Swap the order between configuring nested VMX capabilities and base CPU
capabilities, so that nested VMX support can be conditioned on core KVM
support, e.g. to allow conditioning support for LOAD_CET_STATE on the
presence of IBT or SHSTK. Because the sanity checks on nested VMX config
performed by vmx_check_processor_compat() run _after_ vmx_hardware_setup(),
any use of kvm_cpu_cap_has() when configuring nested VMX support will lead
to failures in vmx_check_processor_compat().
While swapping the order of two (or more) configuration flows can lead to
a game of whack-a-mole, in this case nested support inarguably should be
done after base support. KVM should never condition base support on nested
support, because nested support is fully optional, while obviously it's
desirable to condition nested support on base support. And there's zero
evidence the current ordering was intentional, e.g. commit 66a6950f9995
("KVM: x86: Introduce kvm_cpu_caps to replace runtime CPUID masking")
likely placed the call to kvm_set_cpu_caps() after nested setup because it
looked pretty.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add support for the LOAD_CET_STATE VM-Enter and VM-Exit controls, the
CET XFEATURE bits in XSS, and advertise support for IBT and SHSTK to
userspace. Explicitly clear IBT and SHSTK onn SVM, as additional work is
needed to enable CET on SVM, e.g. to context switch S_CET and other state.
Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET, as running without Unrestricted Guest
can result in KVM emulating large swaths of guest code. While it's highly
unlikely any guest will trigger emulation while also utilizing IBT or
SHSTK, there's zero reason to allow CET without Unrestricted Guest as that
combination should only be possible when explicitly disabling
unrestricted_guest for testing purposes.
Disable CET if VMX_BASIC[bit56] == 0, i.e. if hardware strictly enforces
the presence of an Error Code based on exception vector, as attempting to
inject a #CP with an Error Code (#CP architecturally has an Error Code)
will fail due to the #CP vector historically not having an Error Code.
Clear S_CET and SSP-related VMCS on "reset" to emulate the architectural
of CET MSRs and SSP being reset to 0 after RESET, power-up and INIT. Note,
KVM already clears guest CET state that is managed via XSTATE in
kvm_xstate_reset().
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: move some bits to separate patches, massage changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make IBT and SHSTK virtualization mutually exclusive with "officially"
supporting setups with guest.MAXPHYADDR < host.MAXPHYADDR, i.e. if the
allow_smaller_maxphyaddr module param is set. Running a guest with a
smaller MAXPHYADDR requires intercepting #PF, and can also trigger
emulation of arbitrary instructions. Intercepting and reacting to #PFs
doesn't play nice with SHSTK, as KVM's MMU hasn't been taught to handle
Shadow Stack accesses, and emulating arbitrary instructions doesn't play
nice with IBT or SHSTK, as KVM's emulator doesn't handle the various side
effects, e.g. doesn't enforce end-branch markers or model Shadow Stack
updates.
Note, hiding IBT and SHSTK based solely on allow_smaller_maxphyaddr is
overkill, as allow_smaller_maxphyaddr is only problematic if the guest is
actually configured to have a smaller MAXPHYADDR. However, KVM's ABI
doesn't provide a way to express that IBT and SHSTK may break if enabled
in conjunction with guest.MAXPHYADDR < host.MAXPHYADDR. I.e. the
alternative is to do nothing in KVM and instead update documentation and
hope KVM users are thorough readers. Go with the conservative-but-correct
approach; worst case scenario, this restriction can be dropped if there's
a strong use case for enabling CET on hosts with allow_smaller_maxphyaddr.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Initialize allow_smaller_maxphyaddr during hardware setup as soon as KVM
knows whether or not TDP will be utilized. To avoid having to teach KVM's
emulator all about CET, KVM's upcoming CET virtualization support will be
mutually exclusive with allow_smaller_maxphyaddr, i.e. will disable SHSTK
and IBT if allow_smaller_maxphyaddr is enabled.
In general, allow_smaller_maxphyaddr should be initialized as soon as
possible since it's globally visible while its only input is whether or
not EPT/NPT is enabled. I.e. there's effectively zero risk of setting
allow_smaller_maxphyaddr too early, and substantial risk of setting it
too late.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250922184743.1745778-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Make TDP a hard requirement for Shadow Stacks, as there are no plans to
add Shadow Stack support to the Shadow MMU. E.g. KVM hasn't been taught
to understand the magic Writable=0,Dirty=1 combination that is required
for Shadow Stack accesses, and so enabling Shadow Stacks when using
shadow paging will put the guest into an infinite #PF loop (KVM thinks the
shadow page tables have a valid mapping, hardware says otherwise).
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add CET_KERNEL and CET_USER to KVM's set of supported XSS bits when IBT
*or* SHSTK is supported. Like CR4.CET, XFEATURE support for IBT and SHSTK
are bundle together under the CET umbrella, and thus prone to
virtualization holes if KVM or the guest supports only one of IBT or SHSTK,
but hardware supports both. However, again like CR4.CET, such
virtualization holes are benign from the host's perspective so long as KVM
takes care to always honor the "or" logic.
Require CET_KERNEL and CET_USER to come as a pair, and refuse to support
IBT or SHSTK if one (or both) features is missing, as the (host) kernel
expects them to come as a pair, i.e. may get confused and corrupt state if
only one of CET_KERNEL or CET_USER is supported.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog, add XFEATURE_MASK_CET_ALL]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Unconditionally forward XSAVES/XRSTORS VM-Exits from L2 to L1, as KVM
doesn't utilize the XSS-bitmap (KVM relies on controlling the XSS value
in hardware to prevent unauthorized access to XSAVES state). KVM always
loads vmcs02 with vmcs12's bitmap, and so any exit _must_ be due to
vmcs12's XSS-bitmap.
Drop the comment about XSS never being non-zero in anticipation of
enabling CET_KERNEL and CET_USER support.
Opportunistically WARN if XSAVES is not enabled for L2, as the CPU is
supposed to generate #UD before checking the XSS-bitmap.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop X86_CR4_CET from CR4_RESERVED_BITS and instead mark CET as reserved
if and only if IBT *and* SHSTK are unsupported, i.e. allow CR4.CET to be
set if IBT or SHSTK is supported. This creates a virtualization hole if
the CPU supports both IBT and SHSTK, but the kernel or vCPU model only
supports one of the features. However, it's entirely legal for a CPU to
have only one of IBT or SHSTK, i.e. the hole is a flaw in the architecture,
not in KVM.
More importantly, so long as KVM is careful to initialize and context
switch both IBT and SHSTK state (when supported in hardware) if either
feature is exposed to the guest, a misbehaving guest can only harm itself.
E.g. VMX initializes host CET VMCS fields based solely on hardware
capabilities.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: Mathias Krause <minipli@grsecurity.net>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
[sean: split to separate patch, write changelog]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add PK (Protection Keys), SS (Shadow Stacks), and SGX (Software Guard
Extensions) to the set of #PF error flags handled via
kvm_mmu_trace_pferr_flags. While KVM doesn't expect PK or SS #PFs in
particular, pretty print their names instead of the raw hex value saves
the user from having to go spelunking in the SDM to figure out what's
going on.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add PFERR_SS_MASK, a.k.a. Shadow Stack access, and WARN if KVM attempts to
check permissions for a Shadow Stack access as KVM hasn't been taught to
understand the magic Writable=0,Dirty=1 combination that is required for
Shadow Stack accesses, and likely will never learn. There are no plans to
support Shadow Stacks with the Shadow MMU, and the emulator rejects all
instructions that affect Shadow Stacks, i.e. it should be impossible for
KVM to observe a #PF due to a shadow stack access.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Emulate the Shadow Stack restriction that the current SSP must be a 32-bit
value on a FAR JMP from 64-bit mode to compatibility mode. From the SDM's
pseudocode for FAR JMP:
IF ShadowStackEnabled(CPL)
IF (IA32_EFER.LMA and DEST(segment selector).L) = 0
(* If target is legacy or compatibility mode then the SSP must be in low 4GB *)
IF (SSP & 0xFFFFFFFF00000000 != 0); THEN
#GP(0);
FI;
FI;
FI;
Note, only the current CPL needs to be considered, as FAR JMP can't be
used for inter-privilege level transfers, and KVM rejects emulation of all
other far branch instructions when Shadow Stacks are enabled.
To give the emulator access to GUEST_SSP, special case handling
MSR_KVM_INTERNAL_GUEST_SSP in emulator_get_msr() to treat the access as a
host access (KVM doesn't allow guest accesses to internal "MSRs"). The
->get_msr() API is only used for implicit accesses from the emulator, i.e.
is only used with hardcoded MSR indices, and so any access to
MSR_KVM_INTERNAL_GUEST_SSP is guaranteed to be from KVM, i.e. not from the
guest via RDMSR.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if the guest triggers
task switch emulation with Indirect Branch Tracking or Shadow Stacks
enabled, as attempting to do the right thing would require non-trivial
effort and complexity, KVM doesn't support emulating CET generally, and
it's extremely unlikely that any guest will do task switches while also
utilizing CET. Defer taking on the complexity until someone cares enough
to put in the time and effort to add support.
Per the SDM:
If shadow stack is enabled, then the SSP of the task is located at the
4 bytes at offset 104 in the 32-bit TSS and is used by the processor to
establish the SSP when a task switch occurs from a task associated with
this TSS. Note that the processor does not write the SSP of the task
initiating the task switch to the TSS of that task, and instead the SSP
of the previous task is pushed onto the shadow stack of the new task.
Note, per the SDM's pseudocode on TASK SWITCHING, IBT state for the new
privilege level is updated. To keep things simple, check both S_CET and
U_CET (again, anyone that wants more precise checking can have the honor
of implementing support).
Reported-by: Binbin Wu <binbin.wu@linux.intel.com>
Closes: https://lore.kernel.org/all/819bd98b-2a60-4107-8e13-41f1e4c706b1@linux.intel.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts). Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.
To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect. Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow. Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).
For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions. Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.
Note! On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode. On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.
Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0. Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated). A future commit will add the #GP(0)
check for the specific FAR JMP scenario.
Note #3, task switches also modify SSP and so need to be rejected. That
too will be addressed in a future commit.
Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|