]> xenbits.xensource.com Git - people/pauldu/linux.git/log
people/pauldu/linux.git
14 months agoKVM: pfncache: stop open-coding offset_in_page()
Paul Durrant [Fri, 10 Nov 2023 09:29:01 +0000 (09:29 +0000)]
KVM: pfncache: stop open-coding offset_in_page()

Some code in pfncache uses offset_in_page() but in other places it is open-
coded. Use offset_in_page() consistently everywhere.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
v8:
 - New in this version.

14 months agoKVM: pfncache: remove KVM_GUEST_USES_PFN usage
Paul Durrant [Thu, 9 Nov 2023 15:43:41 +0000 (15:43 +0000)]
KVM: pfncache: remove KVM_GUEST_USES_PFN usage

As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any
callers of kvm_gpc_init(), which also makes the 'vcpu' argument redundant.
Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage
check in hva_to_pfn_retry() and hence the 'usage' argument to
kvm_gpc_init() are also redundant.
Remove the pfn_cache_usage enumeration and remove the redundant arguments,
fields of struct gfn_to_hva_cache, and all the related code.

[1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com/

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v8:
 - New in this version.

14 months agoKVM: pfncache: add a mark-dirty helper
Paul Durrant [Thu, 7 Sep 2023 16:03:40 +0000 (16:03 +0000)]
KVM: pfncache: add a mark-dirty helper

At the moment pages are marked dirty by open-coded calls to
mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot
from the cache. After a subsequent patch these may not always be set
so add a helper now so that caller will protected from the need to know
about this detail.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
v13:
 - s/kvm_gpc_mark_dirty/kvm_gpc_mark_dirty_in_slot
 - Add a check for a NULL memslot pointer

v8:
 - Make the helper a static inline.

14 months agoKVM: x86/xen: mark guest pages dirty with the pfncache lock held
Paul Durrant [Thu, 9 Nov 2023 14:17:02 +0000 (14:17 +0000)]
KVM: x86/xen: mark guest pages dirty with the pfncache lock held

Sampling gpa and memslot from an unlocked pfncache may yield inconsistent
values so, since there is no problem with calling mark_page_dirty_in_slot()
with the pfncache lock held, relocate the calls in
kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events()
accordingly.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
v13:
 - Patch title change.

v8:
 - New in this version.

14 months agoKVM: pfncache: remove unnecessary exports
Paul Durrant [Thu, 9 Nov 2023 13:09:22 +0000 (13:09 +0000)]
KVM: pfncache: remove unnecessary exports

There is no need for the existing kvm_gpc_XXX() functions to be exported.
Clean up now before additional functions are added in subsequent patches.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
v8:
 - New in this version.

14 months agoKVM: pfncache: Add a map helper function
Paul Durrant [Thu, 7 Sep 2023 12:53:13 +0000 (12:53 +0000)]
KVM: pfncache: Add a map helper function

There is a pfncache unmap helper but mapping is open-coded. Arguably this
is fine because mapping is done in only one place, hva_to_pfn_retry(), but
adding the helper does make that function more readable.

No functional change intended.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
v8:
 - Re-work commit comment.
 - Fix CONFIG_HAS_IOMEM=n build.

14 months agoMerge branches 'asyncpf', 'asyncpf_abi', 'fixes', 'generic', 'misc', 'mmu', 'pmu...
Sean Christopherson [Fri, 9 Feb 2024 00:18:44 +0000 (00:18 +0000)]
Merge branches 'asyncpf', 'asyncpf_abi', 'fixes', 'generic', 'misc', 'mmu', 'pmu', 'selftests', 'svm' and 'vmx'

* asyncpf:
  KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
  KVM: Get reference to VM's address space in the async #PF worker
  KVM: Put mm immediately after async #PF worker completes remote gup()
  KVM: Always flush async #PF workqueue when vCPU is being destroyed

* asyncpf_abi:
  KVM: x86: Improve documentation of MSR_KVM_ASYNC_PF_EN
  x86/kvm: Use separate percpu variable to track the enabling of asyncpf

* fixes:
  KVM: x86: Fix KVM_GET_MSRS stack info leak
  KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
  KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu

* generic:
  KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls

* misc:
  KVM: x86: rename push to emulate_push for consistency
  KVM: x86: Clean up partially uninitialized integer in emulate_pop()
  KVM: x86/emulator: emulate movbe with operand-size prefix
  KVM: x86: Fix broken debugregs ABI for 32 bit kernels
  KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()

* mmu:
  KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()

* pmu: (39 commits)
  KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same
  KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired
  KVM: x86/pmu: Expand the comment about what bits are check emulating events
  KVM: x86/pmu: Snapshot event selectors that KVM emulates in software
  KVM: x86/pmu: Process only enabled PMCs when emulating events in software
  KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap
  KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming
  KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code
  KVM: x86/pmu: Add common define to capture fixed counters offset
  KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
  KVM: selftests: Extend PMU counters test to validate RDPMC after WRMSR
  KVM: selftests: Add helpers for safe and safe+forced RDMSR, RDPMC, and XGETBV
  KVM: selftests: Add a forced emulation variation of KVM_ASM_SAFE()
  KVM: selftests: Test PMC virtualization with forced emulation
  KVM: selftests: Move KVM_FEP macro into common library header
  KVM: selftests: Query module param to detect FEP in MSR filtering test
  KVM: selftests: Add helpers to read integer module params
  KVM: selftests: Add a helper to query if the PMU module param is enabled
  KVM: selftests: Expand PMU counters test to verify LLC events
  KVM: selftests: Add functional test for Intel's fixed PMU counters
  ...

* selftests:
  KVM: selftests: Don't assert on exact number of 4KiB in dirty log split test
  KVM: selftests: Fix a semaphore imbalance in the dirty ring logging test
  KVM: x86: Make gtod_is_based_on_tsc() return 'bool'
  KVM: selftests: Make hyperv_clock require TSC based system clocksource
  KVM: selftests: Run clocksource dependent tests with hyperv_clocksource_tsc_page too
  KVM: selftests: Use generic sys_clocksource_is_tsc() in vmx_nested_tsc_scaling_test
  KVM: selftests: Generalize check_clocksource() from kvm_clock_test
  KVM: selftests: Fail tests when open() fails with !ENOENT
  KVM: selftests: Avoid infinite loop in hyperv_features when invtsc is missing
  KVM: selftests: Delete superfluous, unused "stage" variable in AMX test
  KVM: selftests: x86_64: Remove redundant newlines
  KVM: selftests: s390x: Remove redundant newlines
  KVM: selftests: riscv: Remove redundant newlines
  KVM: selftests: aarch64: Remove redundant newlines
  KVM: selftests: Remove redundant newlines
  KVM: selftests: Reword the NX hugepage test's skip message to be more helpful

* svm:
  KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
  KVM: SVM: Add support for allowing zero SEV ASIDs
  KVM: SVM: Use unsigned integers when dealing with ASIDs
  KVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return

* vmx:
  KVM: VMX: Report up-to-date exit qualification to userspace

14 months agoKVM: x86: rename push to emulate_push for consistency
Julian Stecklina [Mon, 9 Oct 2023 09:20:54 +0000 (11:20 +0200)]
KVM: x86: rename push to emulate_push for consistency

push and emulate_pop are counterparts. Rename push to emulate_push and
harmonize its function signature with emulate_pop. This should remove
a bit of cognitive load when reading this code.

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-2-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: x86: Clean up partially uninitialized integer in emulate_pop()
Julian Stecklina [Mon, 9 Oct 2023 09:20:53 +0000 (11:20 +0200)]
KVM: x86: Clean up partially uninitialized integer in emulate_pop()

Explicitly zero out variables passed to emulate_pop() as output params
to harden against consuming uninitialized data, and to make sanitizers
happy.  Many flows that use emulate_pop() pass an "unsigned long" so as
to be able to hold the largest possible operand, but the actual number
of bytes written is usually the word with of the vCPU.  E.g. if the vCPU
is in 16-bit or 32-bit mode (on a 64-bit host), the upper portion of the
output param will be uninitialized.

Passing around the uninitialized data is benign, as actual KVM usage of
the output is also tied to the word width, but passing around
uninitialized data makes some sanitizers rightly complain.

Note, initializing the data in emulate_pop() is not a safe alternative,
e.g. it would result in em_leave() clobbering RBP[31:16] if LEAVE were
emulated with a 16-bit stack.

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20231009092054.556935-1-julian.stecklina@cyberus-technology.de
[sean: massage changelog, drop em_popa() variable size change]]
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: x86/emulator: emulate movbe with operand-size prefix
Thomas Prescher [Tue, 12 Dec 2023 09:59:37 +0000 (10:59 +0100)]
KVM: x86/emulator: emulate movbe with operand-size prefix

The MOVBE instruction can come with an operand-size prefix (66h). In
this, case the x86 emulation code returns EMULATION_FAILED.

It turns out that em_movbe can already handle this case and all that
is missing is an entry in respective opcode tables to populate
gprefix->pfx_66.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20231212095938.26731-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: VMX: Report up-to-date exit qualification to userspace
Chao Gao [Fri, 29 Dec 2023 02:26:52 +0000 (10:26 +0800)]
KVM: VMX: Report up-to-date exit qualification to userspace

Use vmx_get_exit_qual() to read the exit qualification.

vcpu->arch.exit_qualification is cached for EPT violation only and even
for EPT violation, it is stale at this point because the up-to-date
value is cached later in handle_ept_violation().

Fixes: 70bcd708dfd1 ("KVM: vmx: expose more information for KVM_INTERNAL_ERROR_DELIVERY_EV exits")
Signed-off-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20231229022652.300095-1-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES
Sean Christopherson [Wed, 31 Jan 2024 23:56:09 +0000 (15:56 -0800)]
KVM: SVM: Return -EINVAL instead of -EBUSY on attempt to re-init SEV/SEV-ES

Return -EINVAL instead of -EBUSY if userspace attempts KVM_SEV{,ES}_INIT
on a VM that already has SEV active.  Returning -EBUSY is nonsencial as
it's impossible to deactivate SEV without destroying the VM, i.e. the VM
isn't "busy" in any sane sense of the word, and the odds of any userspace
wanting exactly -EBUSY on a userspace bug are minuscule.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: SVM: Add support for allowing zero SEV ASIDs
Ashish Kalra [Wed, 31 Jan 2024 23:56:08 +0000 (15:56 -0800)]
KVM: SVM: Add support for allowing zero SEV ASIDs

Some BIOSes allow the end user to set the minimum SEV ASID value
(CPUID 0x8000001F_EDX) to be greater than the maximum number of
encrypted guests, or maximum SEV ASID value (CPUID 0x8000001F_ECX)
in order to dedicate all the SEV ASIDs to SEV-ES or SEV-SNP.

The SEV support, as coded, does not handle the case where the minimum
SEV ASID value can be greater than the maximum SEV ASID value.
As a result, the following confusing message is issued:

[   30.715724] kvm_amd: SEV enabled (ASIDs 1007 - 1006)

Fix the support to properly handle this case.

Fixes: 916391a2d1dc ("KVM: SVM: Add support for SEV-ES capability in KVM")
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Cc: stable@vger.kernel.org
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240104190520.62510-1-Ashish.Kalra@amd.com
Link: https://lore.kernel.org/r/20240131235609.4161407-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: SVM: Use unsigned integers when dealing with ASIDs
Sean Christopherson [Wed, 31 Jan 2024 23:56:07 +0000 (15:56 -0800)]
KVM: SVM: Use unsigned integers when dealing with ASIDs

Convert all local ASID variables and parameters throughout the SEV code
from signed integers to unsigned integers.  As ASIDs are fundamentally
unsigned values, and the global min/max variables are appropriately
unsigned integers, too.

Functionally, this is a glorified nop as KVM guarantees min_sev_asid is
non-zero, and no CPU supports -1u as the _only_ asid, i.e. the signed vs.
unsigned goof won't cause problems in practice.

Opportunistically use sev_get_asid() in sev_flush_encrypted_page() instead
of open coding an equivalent.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return
Sean Christopherson [Wed, 31 Jan 2024 23:56:06 +0000 (15:56 -0800)]
KVM: SVM: Set sev->asid in sev_asid_new() instead of overloading the return

Explicitly set sev->asid in sev_asid_new() when a new ASID is successfully
allocated, and return '0' to indicate success instead of overloading the
return value to multiplex the ASID with error codes.  There is exactly one
caller of sev_asid_new(), and sev_asid_free() already consumes sev->asid,
i.e. returning the ASID isn't necessary for flexibility, nor does it
provide symmetry between related APIs.

Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20240131235609.4161407-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed
Sean Christopherson [Wed, 10 Jan 2024 01:15:33 +0000 (17:15 -0800)]
KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freed

Nullify the async #PF worker's local "apf" pointer immediately after the
point where the structure can be freed by the vCPU.  The existing comment
is helpful, but easy to overlook as there is no associated code.

Update the comment to clarify that it can be freed by as soon as the lock
is dropped, as "after this point" isn't strictly accurate, nor does it
help understand what prevents the structure from being freed earlier.

Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240110011533.503302-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: Get reference to VM's address space in the async #PF worker
Sean Christopherson [Wed, 10 Jan 2024 01:15:32 +0000 (17:15 -0800)]
KVM: Get reference to VM's address space in the async #PF worker

Get a reference to the target VM's address space in async_pf_execute()
instead of gifting a reference from kvm_setup_async_pf().  Keeping the
address space alive just to service an async #PF is counter-productive,
i.e. if the process is exiting and all vCPUs are dead, then NOT doing
get_user_pages_remote() and freeing the address space asap is desirable.

Handling the mm reference entirely within async_pf_execute() also
simplifies the async #PF flows as a whole, e.g. it's not immediately
obvious when the worker task vs. the vCPU task is responsible for putting
the gifted mm reference.

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Link: https://lore.kernel.org/r/20240110011533.503302-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: Put mm immediately after async #PF worker completes remote gup()
Sean Christopherson [Wed, 10 Jan 2024 01:15:31 +0000 (17:15 -0800)]
KVM: Put mm immediately after async #PF worker completes remote gup()

Put the async #PF worker's reference to the VM's address space as soon as
the worker is done with the mm.  This will allow deferring getting a
reference to the worker itself without having to track whether or not
getting a reference succeeded.

Note, if the vCPU is still alive, there is no danger of the worker getting
stuck with tearing down the host page tables, as userspace also holds a
reference (obviously), i.e. there is no risk of delaying the page-present
notification due to triggering the slow path in mmput().

Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Link: https://lore.kernel.org/r/20240110011533.503302-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: Always flush async #PF workqueue when vCPU is being destroyed
Sean Christopherson [Wed, 10 Jan 2024 01:15:30 +0000 (17:15 -0800)]
KVM: Always flush async #PF workqueue when vCPU is being destroyed

Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put.  Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.

Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:

 WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
 CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Workqueue: events async_pf_execute [kvm]
 RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
 Call Trace:
  <TASK>
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>
 ---[ end trace 0000000000000000 ]---
 INFO: task kworker/8:1:251 blocked for more than 120 seconds.
       Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
 Workqueue: events async_pf_execute [kvm]
 Call Trace:
  <TASK>
  __schedule+0x33f/0xa40
  schedule+0x53/0xc0
  schedule_timeout+0x12a/0x140
  __wait_for_common+0x8d/0x1d0
  __flush_work.isra.0+0x19f/0x2c0
  kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
  kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
  kvm_put_kvm+0x1c1/0x320 [kvm]
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed.  And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.

Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:

        trace_kvm_async_pf_completed(addr, cr2_or_gpa);

        __kvm_vcpu_wake_up(vcpu);

        mmput(mm);

and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.

Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue.  Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).

Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.

Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: x86: Improve documentation of MSR_KVM_ASYNC_PF_EN
Xiaoyao Li [Wed, 25 Oct 2023 05:59:14 +0000 (01:59 -0400)]
KVM: x86: Improve documentation of MSR_KVM_ASYNC_PF_EN

Fix some incorrect statement of MSR_KVM_ASYNC_PF_EN documentation and
state clearly the token in 'struct kvm_vcpu_pv_apf_data' of 'page ready'
event is matchted with the token in CR2 in 'page not present' event.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20231025055914.1201792-3-xiaoyao.li@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agox86/kvm: Use separate percpu variable to track the enabling of asyncpf
Xiaoyao Li [Wed, 25 Oct 2023 05:59:13 +0000 (01:59 -0400)]
x86/kvm: Use separate percpu variable to track the enabling of asyncpf

Refer to commit fd10cde9294f ("KVM paravirt: Add async PF initialization
to PV guest") and commit 344d9588a9df ("KVM: Add PV MSR to enable
asynchronous page faults delivery"). It turns out that at the time when
asyncpf was introduced, the purpose was defining the shared PV data 'struct
kvm_vcpu_pv_apf_data' with the size of 64 bytes. However, it made a mistake
and defined the size to 68 bytes, which failed to make fit in a cache line
and made the code inconsistent with the documentation.

Below justification quoted from Sean[*]

  KVM (the host side) has *never* read kvm_vcpu_pv_apf_data.enabled, and
  the documentation clearly states that enabling is based solely on the
  bit in the synthetic MSR.

  So rather than update the documentation, fix the goof by removing the
  enabled filed and use the separate percpu variable instread.
  KVM-as-a-host obviously doesn't enforce anything or consume the size,
  and changing the header will only affect guests that are rebuilt against
  the new header, so there's no chance of ABI breakage between KVM and its
  guests. The only possible breakage is if some other hypervisor is
  emulating KVM's async #PF (LOL) and relies on the guest to set
  kvm_vcpu_pv_apf_data.enabled. But (a) I highly doubt such a hypervisor
  exists, (b) that would arguably be a violation of KVM's "spec", and
  (c) the worst case scenario is that the guest would simply lose async
  #PF functionality.

[*] https://lore.kernel.org/all/ZS7ERnnRqs8Fl0ZF@google.com/T/#u

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20231025055914.1201792-2-xiaoyao.li@intel.com
[sean: use true/false instead of 1/0 for booleans]
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: selftests: Don't assert on exact number of 4KiB in dirty log split test
Sean Christopherson [Wed, 31 Jan 2024 22:27:28 +0000 (14:27 -0800)]
KVM: selftests: Don't assert on exact number of 4KiB in dirty log split test

Drop dirty_log_page_splitting_test's assertion that the number of 4KiB
pages remains the same across dirty logging being enabled and disabled, as
the test doesn't guarantee that mappings outside of the memslots being
dirty logged are stable, e.g. KVM's mappings for code and pages in
memslot0 can be zapped by things like NUMA balancing.

To preserve the spirit of the check, assert that (a) the number of 4KiB
pages after splitting is _at least_ the number of 4KiB pages across all
memslots under test, and (b) the number of hugepages before splitting adds
up to the number of pages across all memslots under test.  (b) is a little
tenuous as it relies on memslot0 being incompatible with transparent
hugepages, but that holds true for now as selftests explicitly madvise()
MADV_NOHUGEPAGE for memslot0 (__vm_create() unconditionally specifies the
backing type as VM_MEM_SRC_ANONYMOUS).

Reported-by: Yi Lai <yi1.lai@intel.com>
Reported-by: Tao Su <tao1.su@linux.intel.com>
Reviewed-by: Tao Su <tao1.su@linux.intel.com>
Link: https://lore.kernel.org/r/20240131222728.4100079-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: selftests: Fix a semaphore imbalance in the dirty ring logging test
Sean Christopherson [Fri, 2 Feb 2024 23:18:31 +0000 (15:18 -0800)]
KVM: selftests: Fix a semaphore imbalance in the dirty ring logging test

When finishing the final iteration of dirty_log_test testcase, set
host_quit _before_ the final "continue" so that the vCPU worker doesn't
run an extra iteration, and delete the hack-a-fix of an extra "continue"
from the dirty ring testcase.  This fixes a bug where the extra post to
sem_vcpu_cont may not be consumed, which results in failures in subsequent
runs of the testcases.  The bug likely was missed during development as
x86 supports only a single "guest mode", i.e. there aren't any subsequent
testcases after the dirty ring test, because for_each_guest_mode() only
runs a single iteration.

For the regular dirty log testcases, letting the vCPU run one extra
iteration is a non-issue as the vCPU worker waits on sem_vcpu_cont if and
only if the worker is explicitly told to stop (vcpu_sync_stop_requested).
But for the dirty ring test, which needs to periodically stop the vCPU to
reap the dirty ring, letting the vCPU resume the guest _after_ the last
iteration means the vCPU will get stuck without an extra "continue".

However, blindly firing off an post to sem_vcpu_cont isn't guaranteed to
be consumed, e.g. if the vCPU worker sees host_quit==true before resuming
the guest.  This results in a dangling sem_vcpu_cont, which leads to
subsequent iterations getting out of sync, as the vCPU worker will
continue on before the main task is ready for it to resume the guest,
leading to a variety of asserts, e.g.

  ==== Test Assertion Failure ====
  dirty_log_test.c:384: dirty_ring_vcpu_ring_full
  pid=14854 tid=14854 errno=22 - Invalid argument
     1  0x00000000004033eb: dirty_ring_collect_dirty_pages at dirty_log_test.c:384
     2  0x0000000000402d27: log_mode_collect_dirty_pages at dirty_log_test.c:505
     3   (inlined by) run_test at dirty_log_test.c:802
     4  0x0000000000403dc7: for_each_guest_mode at guest_modes.c:100
     5  0x0000000000401dff: main at dirty_log_test.c:941 (discriminator 3)
     6  0x0000ffff9be173c7: ?? ??:0
     7  0x0000ffff9be1749f: ?? ??:0
     8  0x000000000040206f: _start at ??:?
  Didn't continue vcpu even without ring full

Alternatively, the test could simply reset the semaphores before each
testcase, but papering over hacks with more hacks usually ends in tears.

Reported-by: Shaoqin Huang <shahuang@redhat.com>
Fixes: 84292e565951 ("KVM: selftests: Add dirty ring buffer test")
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20240202231831.354848-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: x86: Fix broken debugregs ABI for 32 bit kernels
Mathias Krause [Sat, 3 Feb 2024 12:45:22 +0000 (13:45 +0100)]
KVM: x86: Fix broken debugregs ABI for 32 bit kernels

The ioctl()s to get and set KVM's debug registers are broken for 32 bit
kernels as they'd only copy half of the user register state because of a
UAPI and in-kernel type mismatch (__u64 vs. unsigned long; 8 vs. 4
bytes).

This makes it impossible for userland to set anything but DR0 without
resorting to bit folding tricks.

Switch to a loop for copying debug registers that'll implicitly do the
type conversion for us, if needed.

There are likely no users (left) for 32bit KVM, fix the bug nonetheless.

Fixes: a1efbe77c1fd ("KVM: x86: Add support for saving&restoring debug registers")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240203124522.592778-4-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
14 months agoKVM: x86: Fix KVM_GET_MSRS stack info leak
Mathias Krause [Sat, 3 Feb 2024 12:45:20 +0000 (13:45 +0100)]
KVM: x86: Fix KVM_GET_MSRS stack info leak

Commit 6abe9c1386e5 ("KVM: X86: Move ignore_msrs handling upper the
stack") changed the 'ignore_msrs' handling, including sanitizing return
values to the caller. This was fine until commit 12bc2132b15e ("KVM:
X86: Do the same ignore_msrs check for feature msrs") which allowed
non-existing feature MSRs to be ignored, i.e. to not generate an error
on the ioctl() level. It even tried to preserve the sanitization of the
return value. However, the logic is flawed, as '*data' will be
overwritten again with the uninitialized stack value of msr.data.

Fix this by simplifying the logic and always initializing msr.data,
vanishing the need for an additional error exit path.

Fixes: 12bc2132b15e ("KVM: X86: Do the same ignore_msrs check for feature msrs")
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20240203124522.592778-2-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
Mingwei Zhang [Tue, 23 Jan 2024 22:12:20 +0000 (22:12 +0000)]
KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl

Use a u64 instead of a u8 when taking a snapshot of pmu->fixed_ctr_ctrl
when reprogramming fixed counters, as truncating the value results in KVM
thinking fixed counter 2 is already disabled (the bug also affects fixed
counters 3+, but KVM doesn't yet support those).  As a result, if the
guest disables fixed counter 2, KVM will get a false negative and fail to
reprogram/disable emulation of the counter, which can leads to incorrect
counts and spurious PMIs in the guest.

Fixes: 76d287b2342e ("KVM: x86/pmu: Drop "u8 ctrl, int idx" for reprogram_fixed_counter()")
Cc: stable@vger.kernel.org
Signed-off-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20240123221220.3911317-1-mizhang@google.com
[sean: rewrite changelog to call out the effects of the bug]
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same
Sean Christopherson [Fri, 10 Nov 2023 02:28:57 +0000 (18:28 -0800)]
KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same

Don't bother querying the CPL if a PMC is (not) counting for both USER and
KERNEL, i.e. if the end result is guaranteed to be the same regardless of
the CPL.  Querying the CPL on Intel requires a VMREAD, i.e. isn't free,
and a single CMP+Jcc is cheap.

Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Check eventsel first when emulating (branch) insns retired
Sean Christopherson [Fri, 10 Nov 2023 02:28:56 +0000 (18:28 -0800)]
KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired

When triggering events, i.e. emulating PMC events in software, check for
a matching event selector before checking the event is allowed.  The "is
allowed" check *might* be cheap, but it could also be very costly, e.g. if
userspace has defined a large PMU event filter.  The event selector check
on the other hand is all but guaranteed to be <10 uops, e.g. looks
something like:

   0xffffffff8105e615 <+5>: movabs $0xf0000ffff,%rax
   0xffffffff8105e61f <+15>: xor    %rdi,%rsi
   0xffffffff8105e622 <+18>: test   %rax,%rsi
   0xffffffff8105e625 <+21>: sete   %al

Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Expand the comment about what bits are check emulating events
Sean Christopherson [Fri, 10 Nov 2023 02:28:55 +0000 (18:28 -0800)]
KVM: x86/pmu: Expand the comment about what bits are check emulating events

Expand the comment about what bits are and aren't checked when emulating
PMC events in software.  As pointed out by Jim, AMD's mask includes bits
35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33)
as well as reserved bits (34 and 45).

Checking The IN_TX* bits is actually correct, as it's safe to assert that
the vCPU can't be in an HLE/RTM transaction if KVM is emulating an
instruction, i.e. KVM *shouldn't count if either of those bits is set.

For the reserved bits, KVM is has equal odds of being right if Intel adds
new behavior, i.e. ignoring them is just as likely to be correct as
checking them.

Opportunistically explain *why* the other flags aren't checked.

Suggested-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Snapshot event selectors that KVM emulates in software
Sean Christopherson [Fri, 10 Nov 2023 02:28:54 +0000 (18:28 -0800)]
KVM: x86/pmu: Snapshot event selectors that KVM emulates in software

Snapshot the event selectors for the events that KVM emulates in software,
which is currently instructions retired and branch instructions retired.
The event selectors a tied to the underlying CPU, i.e. are constant for a
given platform even though perf doesn't manage the mappings as such.

Getting the event selectors from perf isn't exactly cheap, especially if
mitigations are enabled, as at least one indirect call is involved.

Snapshot the values in KVM instead of optimizing perf as working with the
raw event selectors will be required if KVM ever wants to emulate events
that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id"
entry.

Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Process only enabled PMCs when emulating events in software
Sean Christopherson [Fri, 10 Nov 2023 02:28:53 +0000 (18:28 -0800)]
KVM: x86/pmu: Process only enabled PMCs when emulating events in software

Mask off disabled counters based on PERF_GLOBAL_CTRL *before* iterating
over PMCs to emulate (branch) instruction required events in software.  In
the common case where the guest isn't utilizing the PMU, pre-checking for
enabled counters turns a relatively expensive search into a few AND uops
and a Jcc.

Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out
of luck as there is no way to check that a PMC isn't being used without
checking the PMC's event selector.

Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap
Sean Christopherson [Fri, 10 Nov 2023 02:28:52 +0000 (18:28 -0800)]
KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap

Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops
that iterate over valid PMCs given a bitmap (and because seeing checkpatch
whine about bad macro style is always amusing).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming
Sean Christopherson [Fri, 10 Nov 2023 02:28:51 +0000 (18:28 -0800)]
KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming

Refactor the handling of the reprogramming bitmap to snapshot and clear
to-be-processed bits before doing the reprogramming, and then explicitly
set bits for PMCs that need to be reprogrammed (again).  This will allow
adding a macro to iterate over all valid PMCs without having to add
special handling for the reprogramming bit, which (a) can have bits set
for non-existent PMCs and (b) needs to clear such bits to avoid wasting
cycles in perpetuity.

Note, the existing behavior of clearing bits after reprogramming does NOT
have a race with kvm_vm_ioctl_set_pmu_event_filter().  Setting a new PMU
filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that
the vCPU isn't in the middle of reprogramming with a stale filter prior to
setting the bitmap.

Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Move pmc_idx => pmc translation helper to common code
Sean Christopherson [Fri, 10 Nov 2023 02:28:50 +0000 (18:28 -0800)]
KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code

Add a common helper for *internal* PMC lookups, and delete the ops hook
and Intel's implementation.  Keep AMD's implementation, but rename it to
amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for
both KVM-internal and guest-initiated lookups.

Because KVM tracks all counters in a single bitmap, getting a counter
when iterating over a bitmap, e.g. of all valid PMCs, requires a small
amount of math, that while simple, isn't super obvious and doesn't use the
same semantics as PMC lookups from RDPMC!  Although AMD doesn't support
fixed counters, the common PMU code still behaves as if there a split, the
high half of which just happens to always be empty.

Opportunstically add a comment to explain both what is going on, and why
KVM uses a single bitmap, e.g. the boilerplate for iterating over separate
bitmaps could be done via macros, so it's not (just) about deduplicating
code.

Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Add common define to capture fixed counters offset
Sean Christopherson [Fri, 10 Nov 2023 02:28:49 +0000 (18:28 -0800)]
KVM: x86/pmu: Add common define to capture fixed counters offset

Add a common define to "officially" solidify KVM's split of counters,
i.e. to commit to using bits 31:0 to track general purpose counters and
bits 63:32 to track fixed counters (which only Intel supports).  KVM
already bleeds this behavior all over common PMU code, and adding a KVM-
defined macro allows clarifying that the value is a _base_, as oppposed to
the _flag_ that is used to access fixed PMCs via RDPMC (which perf
confusingly calls INTEL_PMC_FIXED_RDPMC_BASE).

No functional change intended.

Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled
Sean Christopherson [Fri, 10 Nov 2023 02:28:48 +0000 (18:28 -0800)]
KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled

Move the purging of common PMU metadata from intel_pmu_refresh() to
kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if
the VM is supposed to have a vPMU.

KVM already denies access to the PMU based on kvm->arch.enable_pmu, as
get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already
violates AMD's architecture by not virtualizing a PMU (kernels have long
since learned to not panic when the PMU is unavailable).  But configuring
the PMU as if it were enabled causes unwanted side effects, e.g. calls to
kvm_pmu_trigger_event() waste an absurd number of cycles due to the
all_valid_pmc_idx bitmap being non-zero.

Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization")
Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com
Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86: Make gtod_is_based_on_tsc() return 'bool'
Vitaly Kuznetsov [Tue, 9 Jan 2024 14:11:21 +0000 (15:11 +0100)]
KVM: x86: Make gtod_is_based_on_tsc() return 'bool'

gtod_is_based_on_tsc() is boolean in nature, i.e. it returns '1' for good
clocksources and '0' otherwise. Moreover, its result is used raw by
kvm_get_time_and_clockread()/kvm_get_walltime_and_clockread() which are
'bool'.

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-6-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Make hyperv_clock require TSC based system clocksource
Vitaly Kuznetsov [Tue, 9 Jan 2024 14:11:20 +0000 (15:11 +0100)]
KVM: selftests: Make hyperv_clock require TSC based system clocksource

KVM sets up Hyper-V TSC page clocksource for its guests when system
clocksource is 'based on TSC' (see gtod_is_based_on_tsc()), running
hyperv_clock with any other clocksource leads to imminent failure.

Add the missing requirement to make the test skip gracefully.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-5-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Run clocksource dependent tests with hyperv_clocksource_tsc_page too
Vitaly Kuznetsov [Tue, 9 Jan 2024 14:11:19 +0000 (15:11 +0100)]
KVM: selftests: Run clocksource dependent tests with hyperv_clocksource_tsc_page too

KVM's 'gtod_is_based_on_tsc()' recognizes two clocksources: 'tsc' and
'hyperv_clocksource_tsc_page' and enables kvmclock in 'masterclock'
mode when either is in use. Transform 'sys_clocksource_is_tsc()' into
'sys_clocksource_is_based_on_tsc()' to support the later. This affects
two tests: kvm_clock_test and vmx_nested_tsc_scaling_test, both seem
to work well when system clocksource is 'hyperv_clocksource_tsc_page'.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-4-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Use generic sys_clocksource_is_tsc() in vmx_nested_tsc_scaling_test
Vitaly Kuznetsov [Tue, 9 Jan 2024 14:11:18 +0000 (15:11 +0100)]
KVM: selftests: Use generic sys_clocksource_is_tsc() in vmx_nested_tsc_scaling_test

Despite its name, system_has_stable_tsc() just checks that system
clocksource is 'tsc'; this can now be done with generic
sys_clocksource_is_tsc().

No functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Generalize check_clocksource() from kvm_clock_test
Vitaly Kuznetsov [Tue, 9 Jan 2024 14:11:17 +0000 (15:11 +0100)]
KVM: selftests: Generalize check_clocksource() from kvm_clock_test

Several existing x86 selftests need to check that the underlying system
clocksource is TSC or based on TSC but every test implements its own
check. As a first step towards unification, extract check_clocksource()
from kvm_clock_test and split it into two functions: arch-neutral
'sys_get_cur_clocksource()' and x86-specific 'sys_clocksource_is_tsc()'.
Fix a couple of pre-existing issues in kvm_clock_test: memory leakage in
check_clocksource() and using TEST_ASSERT() instead of TEST_REQUIRE().
The change also makes the test fail when system clocksource can't be read
from sysfs.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240109141121.1619463-2-vkuznets@redhat.com
[sean: eliminate if-elif pattern just to set a bool true]
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()
Kunwu Chan [Tue, 16 Jan 2024 10:00:25 +0000 (18:00 +0800)]
KVM: x86/mmu: Use KMEM_CACHE instead of kmem_cache_create()

Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.

Note, KMEM_CACHE() uses the required alignment of the struct, '8' as the
alignment, whereas KVM's existing code passes '0'.  In the end, the two
values yield the same result as x86's minimum slab alignment is also '8'
(which is not at all coincidental).

Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20240116100025.95702-1-chentao@kylinos.cn
[sean: call out alignment behavior]
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
Prasad Pandit [Wed, 3 Jan 2024 07:53:43 +0000 (13:23 +0530)]
KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu

kvm_vcpu_ioctl_x86_set_vcpu_events() routine makes 'KVM_REQ_NMI'
request for a vcpu even when its 'events->nmi.pending' is zero.
Ex:
    qemu_thread_start
     kvm_vcpu_thread_fn
      qemu_wait_io_event
       qemu_wait_io_event_common
        process_queued_cpu_work
         do_kvm_cpu_synchronize_post_init/_reset
          kvm_arch_put_registers
           kvm_put_vcpu_events (cpu, level=[2|3])

This leads vCPU threads in QEMU to constantly acquire & release the
global mutex lock, delaying the guest boot due to lock contention.
Add check to make KVM_REQ_NMI request only if vcpu has NMI pending.

Fixes: bdedff263132 ("KVM: x86: Route pending NMIs from userspace through process_nmi()")
Cc: stable@vger.kernel.org
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Link: https://lore.kernel.org/r/20240103075343.549293-1-ppandit@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Extend PMU counters test to validate RDPMC after WRMSR
Sean Christopherson [Tue, 9 Jan 2024 23:02:49 +0000 (15:02 -0800)]
KVM: selftests: Extend PMU counters test to validate RDPMC after WRMSR

Extend the read/write PMU counters subtest to verify that RDPMC also reads
back the written value.  Opportunsitically verify that attempting to use
the "fast" mode of RDPMC fails, as the "fast" flag is only supported by
non-architectural PMUs, which KVM doesn't virtualize.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add helpers for safe and safe+forced RDMSR, RDPMC, and XGETBV
Sean Christopherson [Tue, 9 Jan 2024 23:02:48 +0000 (15:02 -0800)]
KVM: selftests: Add helpers for safe and safe+forced RDMSR, RDPMC, and XGETBV

Add helpers for safe and safe-with-forced-emulations versions of RDMSR,
RDPMC, and XGETBV.  Use macro shenanigans to eliminate the rather large
amount of boilerplate needed to get values in and out of registers.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add a forced emulation variation of KVM_ASM_SAFE()
Sean Christopherson [Tue, 9 Jan 2024 23:02:47 +0000 (15:02 -0800)]
KVM: selftests: Add a forced emulation variation of KVM_ASM_SAFE()

Add KVM_ASM_SAFE_FEP() to allow forcing emulation on an instruction that
might fault.  Note, KVM skips RIP past the FEP prefix before injecting an
exception, i.e. the fixup needs to be on the instruction itself.  Do not
check for FEP support, that is firmly the responsibility of whatever code
wants to use KVM_ASM_SAFE_FEP().

Sadly, chaining variadic arguments that contain commas doesn't work, thus
the unfortunate amount of copy+paste.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Test PMC virtualization with forced emulation
Sean Christopherson [Tue, 9 Jan 2024 23:02:46 +0000 (15:02 -0800)]
KVM: selftests: Test PMC virtualization with forced emulation

Extend the PMC counters test to use forced emulation to verify that KVM
emulates counter events for instructions retired and branches retired.
Force emulation for only a subset of the measured code to test that KVM
does the right thing when mixing perf events with emulated events.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Move KVM_FEP macro into common library header
Sean Christopherson [Tue, 9 Jan 2024 23:02:45 +0000 (15:02 -0800)]
KVM: selftests: Move KVM_FEP macro into common library header

Move the KVM_FEP definition, a.k.a. the KVM force emulation prefix, into
processor.h so that it can be used for other tests besides the MSR filter
test.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Query module param to detect FEP in MSR filtering test
Sean Christopherson [Tue, 9 Jan 2024 23:02:44 +0000 (15:02 -0800)]
KVM: selftests: Query module param to detect FEP in MSR filtering test

Add a helper to detect KVM support for forced emulation by querying the
module param, and use the helper to detect support for the MSR filtering
test instead of throwing a noodle/NOP at KVM to see if it sticks.

Cc: Aaron Lewis <aaronlewis@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add helpers to read integer module params
Sean Christopherson [Tue, 9 Jan 2024 23:02:43 +0000 (15:02 -0800)]
KVM: selftests: Add helpers to read integer module params

Add helpers to read integer module params, which is painfully non-trivial
because the pain of dealing with strings in C is exacerbated by the kernel
inserting a newline.

Don't bother differentiating between int, uint, short, etc.  They all fit
in an int, and KVM (thankfully) doesn't have any integer params larger
than an int.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add a helper to query if the PMU module param is enabled
Sean Christopherson [Tue, 9 Jan 2024 23:02:42 +0000 (15:02 -0800)]
KVM: selftests: Add a helper to query if the PMU module param is enabled

Add a helper to probe KVM's "enable_pmu" param, open coding strings in
multiple places is just asking for false negatives and/or runtime errors
due to typos.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Expand PMU counters test to verify LLC events
Sean Christopherson [Tue, 9 Jan 2024 23:02:41 +0000 (15:02 -0800)]
KVM: selftests: Expand PMU counters test to verify LLC events

Expand the PMU counters test to verify that LLC references and misses have
non-zero counts when the code being executed while the LLC event(s) is
active is evicted via CFLUSH{,OPT}.  Note, CLFLUSH{,OPT} requires a fence
of some kind to ensure the cache lines are flushed before execution
continues.  Use MFENCE for simplicity (performance is not a concern).

Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add functional test for Intel's fixed PMU counters
Jinrong Liang [Tue, 9 Jan 2024 23:02:40 +0000 (15:02 -0800)]
KVM: selftests: Add functional test for Intel's fixed PMU counters

Extend the fixed counters test to verify that supported counters can
actually be enabled in the control MSRs, that unsupported counters cannot,
and that enabled counters actually count.

Co-developed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
[sean: fold into the rd/wr access test, massage changelog]
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Test consistency of CPUID with num of fixed counters
Jinrong Liang [Tue, 9 Jan 2024 23:02:39 +0000 (15:02 -0800)]
KVM: selftests: Test consistency of CPUID with num of fixed counters

Extend the PMU counters test to verify KVM emulation of fixed counters in
addition to general purpose counters.  Fixed counters add an extra wrinkle
in the form of an extra supported bitmask.  Thus quoth the SDM:

  fixed-function performance counter 'i' is supported if ECX[i] || (EDX[4:0] > i)

Test that KVM handles a counter being available through either method.

Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Co-developed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Test consistency of CPUID with num of gp counters
Jinrong Liang [Tue, 9 Jan 2024 23:02:38 +0000 (15:02 -0800)]
KVM: selftests: Test consistency of CPUID with num of gp counters

Add a test to verify that KVM correctly emulates MSR-based accesses to
general purpose counters based on guest CPUID, e.g. that accesses to
non-existent counters #GP and accesses to existent counters succeed.

Note, for compatibility reasons, KVM does not emulate #GP when
MSR_P6_PERFCTR[0|1] is not present (writes should be dropped).

Co-developed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Test Intel PMU architectural events on fixed counters
Jinrong Liang [Tue, 9 Jan 2024 23:02:37 +0000 (15:02 -0800)]
KVM: selftests: Test Intel PMU architectural events on fixed counters

Extend the PMU counters test to validate architectural events using fixed
counters.  The core logic is largely the same, the biggest difference
being that if a fixed counter exists, its associated event is available
(the SDM doesn't explicitly state this to be true, but it's KVM's ABI and
letting software program a fixed counter that doesn't actually count would
be quite bizarre).

Note, fixed counters rely on PERF_GLOBAL_CTRL.

Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Co-developed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Test Intel PMU architectural events on gp counters
Jinrong Liang [Tue, 9 Jan 2024 23:02:36 +0000 (15:02 -0800)]
KVM: selftests: Test Intel PMU architectural events on gp counters

Add test cases to verify that Intel's Architectural PMU events work as
expected when they are available according to guest CPUID.  Iterate over a
range of sane PMU versions, with and without full-width writes enabled,
and over interesting combinations of lengths/masks for the bit vector that
enumerates unavailable events.

Test up to vPMU version 5, i.e. the current architectural max.  KVM only
officially supports up to version 2, but the behavior of the counters is
backwards compatible, i.e. KVM shouldn't do something completely different
for a higher, architecturally-defined vPMU version.  Verify KVM behavior
against the effective vPMU version, e.g. advertising vPMU 5 when KVM only
supports vPMU 2 shouldn't magically unlock vPMU 5 features.

According to Intel SDM, the number of architectural events is reported
through CPUID.0AH:EAX[31:24] and the architectural event x is supported
if EBX[x]=0 && EAX[31:24]>x.

Handcode the entirety of the measured section so that the test can
precisely assert on the number of instructions and branches retired.

Co-developed-by: Like Xu <likexu@tencent.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add pmu.h and lib/pmu.c for common PMU assets
Jinrong Liang [Tue, 9 Jan 2024 23:02:35 +0000 (15:02 -0800)]
KVM: selftests: Add pmu.h and lib/pmu.c for common PMU assets

Add a PMU library for x86 selftests to help eliminate open-coded event
encodings, and to reduce the amount of copy+paste between PMU selftests.

Use the new common macro definitions in the existing PMU event filter test.

Cc: Aaron Lewis <aaronlewis@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Extend {kvm,this}_pmu_has() to support fixed counters
Sean Christopherson [Tue, 9 Jan 2024 23:02:34 +0000 (15:02 -0800)]
KVM: selftests: Extend {kvm,this}_pmu_has() to support fixed counters

Extend the kvm_x86_pmu_feature framework to allow querying for fixed
counters via {kvm,this}_pmu_has().  Like architectural events, checking
for a fixed counter annoyingly requires checking multiple CPUID fields, as
a fixed counter exists if:

  FxCtr[i]_is_supported := ECX[i] || (EDX[4:0] > i);

Note, KVM currently doesn't actually support exposing fixed counters via
the bitmask, but that will hopefully change sooner than later, and Intel's
SDM explicitly "recommends" checking both the number of counters and the
mask.

Rename the intermedate "anti_feature" field to simply 'f' since the fixed
counter bitmask (thankfully) doesn't have reversed polarity like the
architectural events bitmask.

Note, ideally the helpers would use BUILD_BUG_ON() to assert on the
incoming register, but the expected usage in PMU tests can't guarantee the
inputs are compile-time constants.

Opportunistically define macros for all of the known architectural events
and fixed counters.

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Drop the "name" param from KVM_X86_PMU_FEATURE()
Sean Christopherson [Tue, 9 Jan 2024 23:02:33 +0000 (15:02 -0800)]
KVM: selftests: Drop the "name" param from KVM_X86_PMU_FEATURE()

Drop the "name" parameter from KVM_X86_PMU_FEATURE(), it's unused and
the name is redundant with the macro, i.e. it's truly useless.

Reviewed-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Add vcpu_set_cpuid_property() to set properties
Jinrong Liang [Tue, 9 Jan 2024 23:02:32 +0000 (15:02 -0800)]
KVM: selftests: Add vcpu_set_cpuid_property() to set properties

Add vcpu_set_cpuid_property() helper function for setting properties, and
use it instead of open coding an equivalent for MAX_PHY_ADDR.  Future vPMU
testcases will also need to stuff various CPUID properties.

Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types
Sean Christopherson [Tue, 9 Jan 2024 23:02:31 +0000 (15:02 -0800)]
KVM: x86/pmu: Explicitly check for RDPMC of unsupported Intel PMC types

Explicitly check for attempts to read unsupported PMC types instead of
letting the bounds check fail.  Functionally, letting the check fail is
ok, but it's unnecessarily subtle and does a poor job of documenting the
architectural behavior that KVM is emulating.

Reviewed-by: Dapeng Mi  <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag
Sean Christopherson [Tue, 9 Jan 2024 23:02:30 +0000 (15:02 -0800)]
KVM: x86/pmu: Treat "fixed" PMU type in RDPMC as index as a value, not flag

Refactor KVM's handling of ECX for RDPMC to treat the FIXED modifier as an
explicit value, not a flag (minus one wart).  While non-architectural PMUs
do use bit 31 as a flag (for "fast" reads), architectural PMUs use the
upper half of ECX to encode the type.  From the SDM:

  ECX[31:16] specifies type of PMC while ECX[15:0] specifies the index of
  the PMC to be read within that type

Note, that the known supported types are 4000H and 2000H, i.e. look a lot
like flags, doesn't contradict the above statement that ECX[31:16] holds
the type, at least not by any sane reading of the SDM.

Keep the explicitly clearing of the FIXED "flag", as KVM subtly relies on
that behavior to disallow unsupported types while allowing the correct
indices for fixed counters.  This wart will be cleaned up in short order.

Opportunistically grab the per-type bitmask in the if-else blocks to
eliminate the one-off usage of the local "fixed" bool.

Reported-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUs
Sean Christopherson [Tue, 9 Jan 2024 23:02:29 +0000 (15:02 -0800)]
KVM: x86/pmu: Disallow "fast" RDPMC for architectural Intel PMUs

Inject #GP on RDPMC if the "fast" flag is set for architectural Intel
PMUs, i.e. if the PMU version is non-zero.  Per Intel's SDM, and confirmed
on bare metal, the "fast" flag is supported only for non-architectural
PMUs, and is reserved for architectural PMUs.

  If the processor does not support architectural performance monitoring
  (CPUID.0AH:EAX[7:0]=0), ECX[30:0] specifies the index of the PMC to be
  read. Setting ECX[31] selects “fast” read mode if supported. In this mode,
  RDPMC returns bits 31:0 of the PMC in EAX while clearing EDX to zero.

  If the processor does support architectural performance monitoring
  (CPUID.0AH:EAX[7:0] ≠ 0), ECX[31:16] specifies type of PMC while ECX[15:0]
  specifies the index of the PMC to be read within that type. The following
  PMC types are currently defined:
  — General-purpose counters use type 0. The index x (to read IA32_PMCx)
    must be less than the value enumerated by CPUID.0AH.EAX[15:8] (thus
    ECX[15:8] must be zero).
  — Fixed-function counters use type 4000H. The index x (to read
    IA32_FIXED_CTRx) can be used if either CPUID.0AH.EDX[4:0] > x or
    CPUID.0AH.ECX[x] = 1 (thus ECX[15:5] must be 0).
  — Performance metrics use type 2000H. This type can be used only if
    IA32_PERF_CAPABILITIES.PERF_METRICS_AVAILABLE[bit 15]=1. For this type,
    the index in ECX[15:0] is implementation specific.

Opportunistically WARN if KVM ever actually tries to complete RDPMC for a
non-architectural PMU, and drop the non-existent "support" for fast RDPMC,
as KVM doesn't support such PMUs, i.e. kvm_pmu_rdpmc() should reject the
RDPMC before getting to the Intel code.

Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests")
Fixes: 67f4d4288c35 ("KVM: x86: rdpmc emulation checks the counter incorrectly")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUs
Sean Christopherson [Tue, 9 Jan 2024 23:02:28 +0000 (15:02 -0800)]
KVM: x86/pmu: Apply "fast" RDPMC only to Intel PMUs

Move the handling of "fast" RDPMC instructions, which drop bits 63:32 of
the count, to Intel.  The "fast" flag, and all modifiers for that matter,
are Intel-only and aren't supported by AMD.

Opportunistically replace open coded bit crud with proper #defines, and
add comments to try and disentangle the flags vs. values mess for
non-architectural vs. architectural PMUs.

Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index
Sean Christopherson [Tue, 9 Jan 2024 23:02:27 +0000 (15:02 -0800)]
KVM: x86/pmu: Prioritize VMX interception over #GP on RDPMC due to bad index

Apply the pre-intercepts RDPMC validity check only to AMD, and rename all
relevant functions to make it as clear as possible that the check is not a
standard PMC index check.  On Intel, the basic rule is that only invalid
opcodes and privilege/permission/mode checks have priority over VM-Exit,
i.e. RDPMC with an invalid index should VM-Exit, not #GP.  While the SDM
doesn't explicitly call out RDPMC, it _does_ explicitly use RDMSR of a
non-existent MSR as an example where VM-Exit has priority over #GP, and
RDPMC is effectively just a variation of RDMSR.

Manually testing on various Intel CPUs confirms this behavior, and the
inverted priority was introduced for SVM compatibility, i.e. was not an
intentional change for Intel PMUs.  On AMD, *all* exceptions on RDPMC have
priority over VM-Exit.

Check for a NULL kvm_pmu_ops.check_rdpmc_early instead of using a RET0
static call so as to provide a convenient location to document the
difference between Intel and AMD, and to again try to make it as obvious
as possible that the early check is a one-off thing, not a generic "is
this PMC valid?" helper.

Fixes: 8061252ee0d2 ("KVM: SVM: Add intercept checks for remaining twobyte instructions")
Cc: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Don't ignore bits 31:30 for RDPMC index on AMD
Sean Christopherson [Tue, 9 Jan 2024 23:02:26 +0000 (15:02 -0800)]
KVM: x86/pmu: Don't ignore bits 31:30 for RDPMC index on AMD

Stop stripping bits 31:30 prior to validating/consuming the RDPMC index on
AMD.  Per the APM's documentation of RDPMC, *values* greater than 27 are
reserved.  The behavior of upper bits being flags is firmly Intel-only.

Fixes: ca724305a2b0 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Get eventsel for fixed counters from perf
Sean Christopherson [Tue, 9 Jan 2024 23:02:25 +0000 (15:02 -0800)]
KVM: x86/pmu: Get eventsel for fixed counters from perf

Get the event selectors used to effectively request fixed counters for
perf events from perf itself instead of hardcoding them in KVM and hoping
that they match the underlying hardware.  While fixed counters 0 and 1 use
architectural events, as of ffbe4ab0beda ("perf/x86/intel: Extend the
ref-cycles event to GP counters") fixed counter 2 (reference TSC cycles)
may use a software-defined pseudo-encoding or a real hardware-defined
encoding.

Reported-by: Kan Liang <kan.liang@linux.intel.com>
Closes: https://lkml.kernel.org/r/4281eee7-6423-4ec8-bb18-c6aeee1faf2c%40linux.intel.com
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Setup fixed counters' eventsel during PMU initialization
Sean Christopherson [Tue, 9 Jan 2024 23:02:24 +0000 (15:02 -0800)]
KVM: x86/pmu: Setup fixed counters' eventsel during PMU initialization

Set the eventsel for all fixed counters during PMU initialization, the
eventsel is hardcoded and consumed if and only if the counter is supported,
i.e. there is no reason to redo the setup every time the PMU is refreshed.

Configuring all KVM-supported fixed counter also eliminates a potential
pitfall if/when KVM supports discontiguous fixed counters, in which case
configuring only nr_arch_fixed_counters will be insufficient (ignoring the
fact that KVM will need many other changes to support discontiguous fixed
counters).

Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodings
Sean Christopherson [Tue, 9 Jan 2024 23:02:23 +0000 (15:02 -0800)]
KVM: x86/pmu: Remove KVM's enumeration of Intel's architectural encodings

Drop KVM's enumeration of Intel's architectural event encodings, and
instead open code the three encodings (of which only two are real) that
KVM uses to emulate fixed counters.  Now that KVM doesn't incorrectly
enforce the availability of architectural encodings, there is no reason
for KVM to ever care about the encodings themselves, at least not in the
current format of an array indexed by the encoding's position in CPUID.

Opportunistically add a comment to explain why KVM cares about eventsel
values for fixed counters.

Suggested-by: Jim Mattson <jmattson@google.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Allow programming events that match unsupported arch events
Sean Christopherson [Tue, 9 Jan 2024 23:02:22 +0000 (15:02 -0800)]
KVM: x86/pmu: Allow programming events that match unsupported arch events

Remove KVM's bogus restriction that the guest can't program an event whose
encoding matches an unsupported architectural event.  The enumeration of
an architectural event only says that if a CPU supports an architectural
event, then the event can be programmed using the architectural encoding.
The enumeration does NOT say anything about the encoding when the CPU
doesn't report support the architectural event.

Preventing the guest from counting events whose encoding happens to match
an architectural event breaks existing functionality whenever Intel adds
an architectural encoding that was *ever* used for a CPU that doesn't
enumerate support for the architectural event, even if the encoding is for
the exact same event!

E.g. the architectural encoding for Top-Down Slots is 0x01a4.  Broadwell
CPUs, which do not support the Top-Down Slots architectural event, 0x01a4
is a valid, model-specific event.  Denying guest usage of 0x01a4 if/when
KVM adds support for Top-Down slots would break any Broadwell-based guest.

Reported-by: Kan Liang <kan.liang@linux.intel.com>
Closes: https://lore.kernel.org/all/2004baa6-b494-462c-a11f-8104ea152c6a@linux.intel.com
Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event")
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86/pmu: Always treat Fixed counters as available when supported
Sean Christopherson [Tue, 9 Jan 2024 23:02:21 +0000 (15:02 -0800)]
KVM: x86/pmu: Always treat Fixed counters as available when supported

Treat fixed counters as available when they are supported, i.e. don't
silently ignore an enabled fixed counter just because guest CPUID says the
associated general purpose architectural event is unavailable.

KVM originally treated fixed counters as always available, but that got
changed as part of a fix to avoid confusing REF_CPU_CYCLES, which does NOT
map to an architectural event, with the actual architectural event used
associated with bit 7, TOPDOWN_SLOTS.

The commit justified the change with:

    If the event is marked as unavailable in the Intel guest CPUID
    0AH.EBX leaf, we need to avoid any perf_event creation, whether
    it's a gp or fixed counter.

but that justification doesn't mesh with reality.  The Intel SDM uses
"architectural events" to refer to both general purpose events (the ones
with the reverse polarity mask in CPUID.0xA.EBX) and the events for fixed
counters, e.g. the SDM makes statements like:

  Each of the fixed-function PMC can count only one architectural
  performance event.

but the fact that fixed counter 2 (TSC reference cycles) doesn't have an
associated general purpose architectural makes trying to apply the mask
from CPUID.0xA.EBX impossible.

Furthermore, the lack of enumeration for an architectural event in CPUID
only means the CPU doesn't officially support the architectural encoding,
i.e. it doesn't mean using the architectural encoding _won't_ work, it
sipmly means there are no guarantees that it will work as expected.  E.g.
if KVM is running in a VM that advertises a fixed counters but not the
corresponding architectural event encoding, and perf decides to use a
general purpose counter instead of a fixed counter, odds are very good
that the underlying hardware actually does support the architectrual
encoding, and that programming the encoding will count the right thing.

In other words, asking perf to count the event will probably work, whereas
intentionally doing nothing is obviously guaranteed to fail.

Note, at the time of the change, KVM didn't enforce hardware support, i.e.
didn't prevent userspace from enumerating support in guest CPUID.0xA.EBX
for architectural events that aren't supported in hardware.  I.e. silently
dropping the fixed counter didn't somehow protection against counting the
wrong event, it just enforced guest CPUID.  And practically speaking, this
issue is almost certainly limited to running KVM on a funky virtual CPU
model.  No known real hardware has an asymmetric PMU where a fixed counter
is supported but the associated architectural event is not.

Fixes: a21864486f7e ("KVM: x86/pmu: Fix available_event_types check for REF_CPU_CYCLES event")
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20240109230250.424295-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Fail tests when open() fails with !ENOENT
Vitaly Kuznetsov [Mon, 29 Jan 2024 08:58:47 +0000 (09:58 +0100)]
KVM: selftests: Fail tests when open() fails with !ENOENT

open_path_or_exit() is used for '/dev/kvm', '/dev/sev', and
'/sys/module/%s/parameters/%s' and skipping test when the entry is missing
is completely reasonable. Other errors, however, may indicate a real issue
which is easy to miss. E.g. when 'hyperv_features' test was entering an
infinite loop the output was:

./hyperv_features
Testing access to Hyper-V specific MSRs
1..0 # SKIP - /dev/kvm not available (errno: 24)

and this can easily get overlooked.

Keep ENOENT case 'special' for skipping tests and fail when open() results
in any other errno.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240129085847.2674082-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Avoid infinite loop in hyperv_features when invtsc is missing
Vitaly Kuznetsov [Mon, 29 Jan 2024 08:58:46 +0000 (09:58 +0100)]
KVM: selftests: Avoid infinite loop in hyperv_features when invtsc is missing

When X86_FEATURE_INVTSC is missing, guest_test_msrs_access() was supposed
to skip testing dependent Hyper-V invariant TSC feature. Unfortunately,
'continue' does not lead to that as stage is not incremented. Moreover,
'vm' allocated with vm_create_with_one_vcpu() is not freed and the test
runs out of available file descriptors very quickly.

Fixes: bd827bd77537 ("KVM: selftests: Test Hyper-V invariant TSC control")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240129085847.2674082-1-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Delete superfluous, unused "stage" variable in AMX test
Sean Christopherson [Tue, 9 Jan 2024 22:03:02 +0000 (14:03 -0800)]
KVM: selftests: Delete superfluous, unused "stage" variable in AMX test

Delete the AMX's tests "stage" counter, as the counter is no longer used,
which makes clang unhappy:

  x86_64/amx_test.c:224:6: error: variable 'stage' set but not used
          int stage, ret;
              ^
  1 error generated.

Note, "stage" was never really used, it just happened to be dumped out by
a (failed) assertion on run->exit_reason, i.e. the AMX test has no concept
of stages, the code was likely copy+pasted from a different test.

Fixes: c96f57b08012 ("KVM: selftests: Make vCPU exit reason test assertion common")
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240109220302.399296-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: x86_64: Remove redundant newlines
Andrew Jones [Wed, 6 Dec 2023 17:02:47 +0000 (18:02 +0100)]
KVM: selftests: x86_64: Remove redundant newlines

TEST_* functions append their own newline. Remove newlines from
TEST_* callsites to avoid extra newlines in output.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20231206170241.82801-12-ajones@ventanamicro.com
[sean: keep the newline in the "tsc\n" strncmp()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: s390x: Remove redundant newlines
Andrew Jones [Wed, 6 Dec 2023 17:02:46 +0000 (18:02 +0100)]
KVM: selftests: s390x: Remove redundant newlines

TEST_* functions append their own newline. Remove newlines from
TEST_* callsites to avoid extra newlines in output.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20231206170241.82801-11-ajones@ventanamicro.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: riscv: Remove redundant newlines
Andrew Jones [Wed, 6 Dec 2023 17:02:45 +0000 (18:02 +0100)]
KVM: selftests: riscv: Remove redundant newlines

TEST_* functions append their own newline. Remove newlines from
TEST_* callsites to avoid extra newlines in output.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20231206170241.82801-10-ajones@ventanamicro.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: aarch64: Remove redundant newlines
Andrew Jones [Wed, 6 Dec 2023 17:02:44 +0000 (18:02 +0100)]
KVM: selftests: aarch64: Remove redundant newlines

TEST_* functions append their own newline. Remove newlines from
TEST_* callsites to avoid extra newlines in output.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Acked-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20231206170241.82801-9-ajones@ventanamicro.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Remove redundant newlines
Andrew Jones [Wed, 6 Dec 2023 17:02:43 +0000 (18:02 +0100)]
KVM: selftests: Remove redundant newlines

TEST_* functions append their own newline. Remove newlines from
TEST_* callsites to avoid extra newlines in output.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://lore.kernel.org/r/20231206170241.82801-8-ajones@ventanamicro.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: selftests: Reword the NX hugepage test's skip message to be more helpful
Sean Christopherson [Wed, 29 Nov 2023 22:40:41 +0000 (14:40 -0800)]
KVM: selftests: Reword the NX hugepage test's skip message to be more helpful

Rework the NX hugepage test's skip message regarding the magic token to
provide all of the necessary magic, and to very explicitly recommended
using the wrapper shell script.

Opportunistically remove an overzealous newline; splitting the
recommendation message across two lines of ~45 characters makes it much
harder to read than running out a single line to 98 characters.

Link: https://lore.kernel.org/r/20231129224042.530798-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()
Nikolay Borisov [Mon, 30 Oct 2023 14:17:28 +0000 (16:17 +0200)]
KVM: x86: Use mutex guards to eliminate __kvm_x86_vendor_init()

Use the recently introduced guard(mutex) infrastructure acquire and
automatically release vendor_module_lock when the guard goes out of scope.
Drop the inner __kvm_x86_vendor_init(), its sole purpose was to simplify
releasing vendor_module_lock in error paths.

No functional change intended.

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20231030141728.1406118-1-nik.borisov@suse.com
[sean: rewrite changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoKVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls
Sean Christopherson [Wed, 10 Jan 2024 00:42:39 +0000 (16:42 -0800)]
KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() calls

When handling the end of an mmu_notifier invalidation, WARN if
mn_active_invalidate_count is already 0 do not decrement it further, i.e.
avoid causing mn_active_invalidate_count to underflow/wrap.  In the worst
case scenario, effectively corrupting mn_active_invalidate_count could
cause kvm_swap_active_memslots() to hang indefinitely.

end() calls are *supposed* to be paired with start(), i.e. underflow can
only happen if there is a bug elsewhere in the kernel, but due to lack of
lockdep assertions in the mmu_notifier helpers, it's all too easy for a
bug to go unnoticed for some time, e.g. see the recently introduced
PAGEMAP_SCAN ioctl().

Ideally, mmu_notifiers would incorporate lockdep assertions, but users of
mmu_notifiers aren't required to hold any one specific lock, i.e. adding
the necessary annotations to make lockdep aware of all locks that are
mutally exclusive with mm_take_all_locks() isn't trivial.

Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com
Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
15 months agoLinux 6.8-rc2
Linus Torvalds [Mon, 29 Jan 2024 01:01:12 +0000 (17:01 -0800)]
Linux 6.8-rc2

15 months agoMerge tag 'cxl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Linus Torvalds [Sun, 28 Jan 2024 21:55:56 +0000 (13:55 -0800)]
Merge tag 'cxl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull cxl fixes from Dan Williams:
 "A build regression fix, a device compatibility fix, and an original
  bug preventing creation of large (16 device) interleave sets:

   - Fix unit test build regression fallout from global
     "missing-prototypes" change

   - Fix compatibility with devices that do not support interrupts

   - Fix overflow when calculating the capacity of large interleave sets"

* tag 'cxl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/region:Fix overflow issue in alloc_hpa()
  cxl/pci: Skip irq features if MSI/MSI-X are not supported
  tools/testing/nvdimm: Disable "missing prototypes / declarations" warnings
  tools/testing/cxl: Disable "missing prototypes / declarations" warnings

15 months agoMerge tag 'mips-fixes_6.8_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Sun, 28 Jan 2024 18:43:06 +0000 (10:43 -0800)]
Merge tag 'mips-fixes_6.8_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fixes from Thomas Bogendoerfer:

 - fix boot issue on single core Lantiq Danube devices

 - fix boot issue on Loongson64 platforms

 - fix improper FPU setup

 - fix missing prototypes issues

* tag 'mips-fixes_6.8_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan
  MIPS: loongson64: set nid for reserved memblock region
  Revert "MIPS: loongson64: set nid for reserved memblock region"
  MIPS: lantiq: register smp_ops on non-smp platforms
  MIPS: loongson64: set nid for reserved memblock region
  MIPS: reserve exception vector space ONLY ONCE
  MIPS: BCM63XX: Fix missing prototypes
  MIPS: sgi-ip32: Fix missing prototypes
  MIPS: sgi-ip30: Fix missing prototypes
  MIPS: fw arc: Fix missing prototypes
  MIPS: sgi-ip27: Fix missing prototypes
  MIPS: Alchemy: Fix missing prototypes
  MIPS: Cobalt: Fix missing prototypes

15 months agoMerge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2024 18:38:16 +0000 (10:38 -0800)]
Merge tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

 - Prevent an inconsistent futex operation leading to stale state
   exposure

* tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Prevent the reuse of stale pi_state

15 months agoMerge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2024 18:34:55 +0000 (10:34 -0800)]
Merge tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

 - Initialize the resend node of each IRQ descriptor, not only the first
   one

* tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Initialize resend_node hlist for all interrupt descriptors

15 months agoMerge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2024 18:33:14 +0000 (10:33 -0800)]
Merge tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Preserve the number of idle calls and sleep entries across CPU
   hotplug events in order to be able to compute correct averages

 - Limit the duration of the clocksource watchdog checking interval as
   too long intervals lead to wrongly marking the TSC as unstable

* tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick/sched: Preserve number of idle sleeps across CPU hotplug events
  clocksource: Skip watchdog check for large watchdog intervals

15 months agoMerge tag 'x86_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 28 Jan 2024 17:45:11 +0000 (09:45 -0800)]
Merge tag 'x86_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure 32-bit syscall registers are properly sign-extended

 - Add detection for AMD's Zen5 generation CPUs and Intel's Clearwater
   Forest CPU model number

 - Make a stub function export non-GPL because it is part of the
   paravirt alternatives and that can be used by non-GPL code

* tag 'x86_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/CPU/AMD: Add more models to X86_FEATURE_ZEN5
  x86/entry/ia32: Ensure s32 is sign extended to s64
  x86/cpu: Add model number for Intel Clearwater Forest processor
  x86/CPU/AMD: Add X86_FEATURE_ZEN5
  x86/paravirt: Make BUG_func() usable by non-GPL modules

15 months agoMerge tag 'fixes-2024-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Sun, 28 Jan 2024 17:41:39 +0000 (09:41 -0800)]
Merge tag 'fixes-2024-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull memblock fix from Mike Rapoport:
 "Fix crash when reserved memory is not added to memory.

  When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the initialization
  of reserved pages may cause access of NODE_DATA() with invalid nid and
  crash.

  Add a fall back to early_pfn_to_nid() in memmap_init_reserved_pages()
  to ensure a valid node id is always passed to init_reserved_page()"

* tag 'fixes-2024-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  memblock: fix crash when reserved memory is not added to memory

15 months agoMerge tag 'platform-drivers-x86-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 27 Jan 2024 17:48:55 +0000 (09:48 -0800)]
Merge tag 'platform-drivers-x86-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:

 - WMI bus driver fixes

 - Second attempt (previously reverted) at P2SB PCI rescan deadlock fix

 - AMD PMF driver improvements

 - MAINTAINERS updates

 - Misc other small fixes and hw-id additions

* tag 'platform-drivers-x86-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: touchscreen_dmi: Add info for the TECLAST X16 Plus tablet
  platform/x86/intel/ifs: Call release_firmware() when handling errors.
  platform/x86/amd/pmf: Fix memory leak in amd_pmf_get_pb_data()
  platform/x86/amd/pmf: Get ambient light information from AMD SFH driver
  platform/x86/amd/pmf: Get Human presence information from AMD SFH driver
  platform/mellanox: mlxbf-pmc: Fix offset calculation for crspace events
  platform/mellanox: mlxbf-tmfifo: Drop Tx network packet when Tx TmFIFO is full
  MAINTAINERS: remove defunct acpi4asus project info from asus notebooks section
  MAINTAINERS: add Luke Jones as maintainer for asus notebooks
  MAINTAINERS: Remove Perry Yuan as DELL WMI HARDWARE PRIVACY SUPPORT maintainer
  platform/x86: silicom-platform: Add missing "Description:" for power_cycle sysfs attr
  platform/x86: intel-wmi-sbl-fw-update: Fix function name in error message
  platform/x86: p2sb: Use pci_resource_n() in p2sb_read_bar0()
  platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe
  platform/x86: intel-uncore-freq: Fix types in sysfs callbacks
  platform/x86: wmi: Fix wmi_dev_probe()
  platform/x86: wmi: Fix notify callback locking
  platform/x86: wmi: Decouple legacy WMI notify handlers from wmi_block_list
  platform/x86: wmi: Return immediately if an suitable WMI event is found
  platform/x86: wmi: Fix error handling in legacy WMI notify handler functions

15 months agoMerge tag 'loongarch-fixes-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 27 Jan 2024 17:44:40 +0000 (09:44 -0800)]
Merge tag 'loongarch-fixes-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
 "Fix boot failure on machines with more than 8 nodes, and fix two build
  errors about KVM"

* tag 'loongarch-fixes-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Add returns to SIMD stubs
  LoongArch: KVM: Fix build due to API changes
  LoongArch/smp: Call rcutree_report_cpu_starting() at tlb_init()

15 months agoMerge tag 'xfs-6.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Linus Torvalds [Sat, 27 Jan 2024 17:17:01 +0000 (09:17 -0800)]
Merge tag 'xfs-6.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Chandan Babu:

 - Fix read only mounts when using fsopen mount API

* tag 'xfs-6.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: read only mounts with fsopen mount API are busted

15 months agoMerge tag 'bcachefs-2024-01-26' of https://evilpiepirate.org/git/bcachefs
Linus Torvalds [Sat, 27 Jan 2024 17:11:52 +0000 (09:11 -0800)]
Merge tag 'bcachefs-2024-01-26' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:

 - fix for REQ_OP_FLUSH usage; this fixes filesystems going read only
   with -EOPNOTSUPP from the block layer.

   (this really should have gone in with the block layer patch causing
   the -EOPNOTSUPP, or should have gone in before).

 - fix an allocation in non-sleepable context

 - fix one source of srcu lock latency, on devices with terrible discard
   latency

 - fix a reattach_inode() issue in fsck

* tag 'bcachefs-2024-01-26' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: __lookup_dirent() works in snapshot, not subvol
  bcachefs: discard path uses unlock_long()
  bcachefs: fix incorrect usage of REQ_OP_FLUSH
  bcachefs: Add gfp flags param to bch2_prt_task_backtrace()

15 months agoMerge tag '6.8-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd
Linus Torvalds [Sat, 27 Jan 2024 17:06:56 +0000 (09:06 -0800)]
Merge tag '6.8-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - Fix netlink OOB

 - Minor kernel doc fix

* tag '6.8-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix global oob in ksmbd_nl_policy
  smb: Fix some kernel-doc comments

15 months agoMerge tag '6.8-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 27 Jan 2024 17:02:42 +0000 (09:02 -0800)]
Merge tag '6.8-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Nine cifs/smb client fixes

   - Four network error fixes (three relating to replays of requests
     that need to be retried, and one fixing some places where we were
     returning the wrong rc up the stack on network errors)

   - Two multichannel fixes including locking fix and case where subset
     of channels need reconnect

   - netfs integration fixup: share remote i_size with netfslib

   - Two small cleanups (one for addressing a clang warning)"

* tag '6.8-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix stray unlock in cifs_chan_skip_or_disable
  cifs: set replay flag for retries of write command
  cifs: commands that are retried should have replay flag set
  cifs: helper function to check replayable error codes
  cifs: translate network errors on send to -ECONNABORTED
  cifs: cifs_pick_channel should try selecting active channels
  cifs: Share server EOF pos with netfslib
  smb: Work around Clang __bdos() type confusion
  smb: client: delete "true", "false" defines

15 months agomips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan
Xi Ruoyao [Fri, 26 Jan 2024 21:05:57 +0000 (05:05 +0800)]
mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan

If we still own the FPU after initializing fcr31, when we are preempted
the dirty value in the FPU will be read out and stored into fcr31,
clobbering our setting.  This can cause an improper floating-point
environment after execve().  For example:

    zsh% cat measure.c
    #include <fenv.h>
    int main() { return fetestexcept(FE_INEXACT); }
    zsh% cc measure.c -o measure -lm
    zsh% echo $((1.0/3)) # raising FE_INEXACT
    0.33333333333333331
    zsh% while ./measure; do ; done
    (stopped in seconds)

Call lose_fpu(0) before setting fcr31 to prevent this.

Closes: https://lore.kernel.org/linux-mips/7a6aa1bbdbbe2e63ae96ff163fab0349f58f1b9e.camel@xry111.site/
Fixes: 9b26616c8d9d ("MIPS: Respect the ISA level in FCSR handling")
Cc: stable@vger.kernel.org
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
15 months agoMIPS: loongson64: set nid for reserved memblock region
Huang Pei [Sat, 27 Jan 2024 09:12:21 +0000 (17:12 +0800)]
MIPS: loongson64: set nid for reserved memblock region

Commit 61167ad5fecd("mm: pass nid to reserve_bootmem_region()") reveals
that reserved memblock regions have no valid node id set, just set it
right since loongson64 firmware makes it clear in memory layout info.

This works around booting failure on 3A1000+ since commit 61167ad5fecd
("mm: pass nid to reserve_bootmem_region()") under
CONFIG_DEFERRED_STRUCT_PAGE_INIT.

Signed-off-by: Huang Pei <huangpei@loongson.cn>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
15 months agoRevert "MIPS: loongson64: set nid for reserved memblock region"
Thomas Bogendoerfer [Sat, 27 Jan 2024 10:07:49 +0000 (11:07 +0100)]
Revert "MIPS: loongson64: set nid for reserved memblock region"

This reverts commit ce7b1b97776ec0b068c4dd6b6dbb48ae09a23519.

Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>