]> xenbits.xensource.com Git - xen.git/log
xen.git
12 months agox86/MCE: switch some callback invocations to altcall
Jan Beulich [Mon, 22 Jan 2024 12:41:07 +0000 (13:41 +0100)]
x86/MCE: switch some callback invocations to altcall

While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.

While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 85ba4d050f9f3c4286164f21660ae88435b7e83c)

12 months agox86/MCE: separate BSP-only initialization
Jan Beulich [Mon, 22 Jan 2024 12:40:32 +0000 (13:40 +0100)]
x86/MCE: separate BSP-only initialization

Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 9f58616ddb1cc1870399de2202fafc7bf0d61694)

12 months agox86/PV: avoid indirect call for I/O emulation quirk hook
Jan Beulich [Mon, 22 Jan 2024 12:40:00 +0000 (13:40 +0100)]
x86/PV: avoid indirect call for I/O emulation quirk hook

This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.

While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
  in sync with the declaration (for Misra).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1212af3e8c4d3a1350046d4fe0ca3b97b51e67de)

12 months agox86/MTRR: avoid several indirect calls
Jan Beulich [Mon, 22 Jan 2024 12:39:23 +0000 (13:39 +0100)]
x86/MTRR: avoid several indirect calls

The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9e0eb30d4d6565b411499ca826718b4b9acab68)

12 months agocore-parking: use alternative_call()
Jan Beulich [Mon, 22 Jan 2024 12:38:24 +0000 (13:38 +0100)]
core-parking: use alternative_call()

This way we can arrange for core_parking_{performance,power}()'s ENDBR
to also be zapped.

For the decision to be taken before the 2nd alternative patching pass,
the initcall needs to become a pre-SMP one, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1bc07ebcac3b1bb2a378732bc0f9a19940e76faf)

12 months agox86/HPET: avoid an indirect call
Jan Beulich [Wed, 17 Jan 2024 09:43:02 +0000 (10:43 +0100)]
x86/HPET: avoid an indirect call

When this code was written, indirect branches still weren't considered
much of a problem (besides being a little slower). Instead of a function
pointer, pass a boolean to _disable_pit_irq(), thus allowing to
eliminate two ENDBR (one of them in .text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 730d2637a8e5b98dc8e4e366179b4cedc496b3ad)

12 months agocpufreq: finish conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:42:27 +0000 (10:42 +0100)]
cpufreq: finish conversion to altcall

Even functions used on infrequently executed paths want converting: This
way all pre-filled struct cpufreq_driver instances can become
__initconst_cf_clobber, thus allowing to eliminate another 15 ENDBR
during the 2nd phase of alternatives patching.

For acpi-cpufreq's optionally populated .get hook make sure alternatives
patching can actually see the pointer. See also the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 467ae515caee491e9b6ae1da8b9b98d094955822)

12 months agox86/APIC: finish genapic conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:41:52 +0000 (10:41 +0100)]
x86/APIC: finish genapic conversion to altcall

While .probe() doesn't need fiddling with for being run only very early,
init_apic_ldr() wants converting too despite not being on a frequently
executed path: This way all pre-filled struct genapic instances can
become __initconst_cf_clobber, thus allowing to eliminate 15 more ENDBR
during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "genapic" itself
to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b1cc53753cba4c3253f2e1093a3a6a9a828314bf)

12 months agox86/spec-ctrl: Fix BTC/SRSO mitigations
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations

We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.

This is XSA-455 / CVE-2024-31142.

Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
12 months agohypercall_xlat_continuation: Replace BUG_ON with domain_crash
Bjoern Doebel [Wed, 27 Mar 2024 18:30:55 +0000 (18:30 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash

Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9926e692c4afc40bcd66f8416ff6a1e93ce402f6)

12 months agox86/HVM: clear upper halves of GPRs upon entry from 32-bit code
Jan Beulich [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
x86/HVM: clear upper halves of GPRs upon entry from 32-bit code

Hypercalls in particular can be the subject of continuations, and logic
there checks updated state against incoming register values. If the
guest manufactured a suitable argument register with a non-zero upper
half before entering compatibility mode and issuing a hypercall from
there, checks in hypercall_xlat_continuation() might trip.

Since for HVM we want to also be sure to not hit a corner case in the
emulator, initiate the clipping right from the top of
{svm,vmx}_vmexit_handler(). Also rename the invoked function, as it no
longer does only invalidation of fields.

Note that architecturally the upper halves of registers are undefined
after a switch between compatibility and 64-bit mode (either direction).
Hence once having entered compatibility mode, the guest can't assume
the upper half of any register to retain its value.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 6a98383b0877bb66ebfe189da43bf81abe3d7909)

12 months agoxen/livepatch: Fix .altinstructions safety checks
Andrew Cooper [Thu, 13 Apr 2023 19:56:15 +0000 (20:56 +0100)]
xen/livepatch: Fix .altinstructions safety checks

The prior check has && vs || mixups, making it tautologically false and thus
providing no safety at all.  There are boundary errors too.

First start with a comment describing how the .altinstructions and
.altinstr_replacement sections interact, and perform suitable cross-checking.

Second, rewrite the alt_instr loop entirely from scratch.  Origin sites have
non-zero size, and must be fully contained within the livepatches .text
section(s).  Any non-zero sized replacements must be fully contained within
the .altinstr_replacement section.

Fixes: f8a10174e8b1 ("xsplice: Add support for alternatives")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit e74360e4ba4a6b6827a44f8b1b22a0ec4311694a)

12 months agoarm/alternatives: Rename alt_instr fields which are used in common code
Andrew Cooper [Sun, 16 Apr 2023 00:10:43 +0000 (01:10 +0100)]
arm/alternatives: Rename alt_instr fields which are used in common code

Alternatives auditing for livepatches is currently broken.  To fix it, the
livepatch code needs to inspect more fields of alt_instr.

Rename ARM's fields to match x86's, because:

 * ARM already exposes alt_offset under the repl name via ALT_REPL_PTR().
 * "alt" is ambiguous in a structure entirely about alternatives already.
 * "repl", being the same width as orig leads to slightly neater code.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 418cf59c4e29451010d7efb3835b900690d19866)

13 months agotests/resource: Fix HVM guest in !SHADOW builds
Andrew Cooper [Tue, 2 Apr 2024 14:24:07 +0000 (16:24 +0200)]
tests/resource: Fix HVM guest in !SHADOW builds

Right now, test-resource always creates HVM Shadow guests.  But if Xen has
SHADOW compiled out, running the test yields:

  $./test-resource
  XENMEM_acquire_resource tests
  Test x86 PV
    Created d1
    Test grant table
  Test x86 PVH
    Skip: 95 - Operation not supported

and doesn't really test HVM guests, but doesn't fail either.

There's nothing paging-mode-specific about this test, so default to HAP if
possible and provide a more specific message if neither HAP or Shadow are
available.

As we've got physinfo to hand, also provide more specific message about the
absence of PV or HVM support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0263dc9069ddb66335c72a159e09050b1600e56a
master date: 2024-03-01 20:14:19 +0000

13 months agox86/boot: Support the watchdog on newer AMD systems
Andrew Cooper [Tue, 2 Apr 2024 14:20:30 +0000 (16:20 +0200)]
x86/boot: Support the watchdog on newer AMD systems

The MSRs used by setup_k7_watchdog() are architectural in 64bit.  The Unit
Select (0x76, cycles not in halt state) isn't, but it hasn't changed in 25
years, making this a trend likely to continue.

Drop the family check.  If the Unit Select does happen to change meaning in
the future, check_nmi_watchdog() will still notice the watchdog not operating
as expected.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 131892e0dcc1265b621c2b7d844cb9e7c3a4404f
master date: 2024-03-19 18:29:37 +0000

13 months agox86/boot: Improve the boot watchdog determination of stuck cpus
Andrew Cooper [Tue, 2 Apr 2024 14:20:09 +0000 (16:20 +0200)]
x86/boot: Improve the boot watchdog determination of stuck cpus

Right now, check_nmi_watchdog() has two processing loops over all online CPUs
using prev_nmi_count as storage.

Use a cpumask_t instead (1/32th as much initdata) and have wait_for_nmis()
make the determination of whether it is stuck, rather than having both
functions needing to agree on how many ticks mean stuck.

More importantly though, it means we can use the standard cpumask
infrastructure, including turning this:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511} stuck

into the rather more manageable:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0-511} stuck

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 9e18f339830c828798aef465556d4029d83476a0
master date: 2024-03-19 18:29:37 +0000

13 months agox86/livepatch: Relax permissions on rodata too
Andrew Cooper [Tue, 2 Apr 2024 14:19:36 +0000 (16:19 +0200)]
x86/livepatch: Relax permissions on rodata too

This reinstates the capability to patch .rodata in load/unload hooks, which
was lost when we stopped using CR0.WP=0 to patch.

This turns out to be rather less of a large TODO than I thought at the time.

Fixes: 8676092a0f16 ("x86/livepatch: Fix livepatch application when CET is active")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: b083b1c393dc8961acf0959b1d2e0ad459985ae3
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Include rodata pointers
Andrew Cooper [Tue, 2 Apr 2024 14:19:11 +0000 (16:19 +0200)]
xen/virtual-region: Include rodata pointers

These are optional.  .init doesn't distinguish types of data like this, and
livepatches don't necesserily have any .rodata either.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ef969144a425e39f5b214a875b5713d0ea8575fb
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Rename the start/end fields
Andrew Cooper [Tue, 2 Apr 2024 14:18:51 +0000 (16:18 +0200)]
xen/virtual-region: Rename the start/end fields

... to text_{start,end}.  We're about to introduce another start/end pair.

Despite it's name, struct virtual_region has always been a module-ish
description.  Call this out specifically.

As minor cleanup, replace ROUNDUP(x, PAGE_SIZE) with the more concise
PAGE_ALIGN() ahead of duplicating the example.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 989556c6f8ca080f5f202417af97d1188b9ba52a
master date: 2024-03-07 14:24:42 +0000

13 months agox86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies
Andrew Cooper [Tue, 2 Apr 2024 14:18:05 +0000 (16:18 +0200)]
x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies

The block in recalculate_cpuid_policy() predates the proper split between
default and max policies, and was a "slightly max for a toolstack which knows
about it" capability.  It didn't get transformed properly in Xen 4.14.

Because Xen will accept a VM with HTT/CMP_LEGACY seen, they should be visible
in the max polices.  Keep the default policy matching host settings.

This manifested as an incorrectly-rejected migration across XenServer's Xen
4.13 -> 4.17 upgrade, as Xapi is slowly growing the logic to check a VM
against the target max policy.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2d8a652251660c3252d92b442e1a9c5d6e6a1e9
master date: 2024-03-01 20:14:19 +0000

13 months agox86/cpu-policy: Hide x2APIC from PV guests
Andrew Cooper [Tue, 2 Apr 2024 14:17:25 +0000 (16:17 +0200)]
x86/cpu-policy: Hide x2APIC from PV guests

PV guests can't write to MSR_APIC_BASE (in order to set EXTD), nor can they
access any of the x2APIC MSR range.  Therefore they mustn't see the x2APIC
CPUID bit saying that they can.

Right now, the host x2APIC flag filters into PV guests, meaning that PV guests
generally see x2APIC except on Zen1-and-older AMD systems.

Linux works around this by explicitly hiding the bit itself, and filtering
EXTD out of MSR_APIC_BASE reads.  NetBSD behaves more in the spirit of PV
guests, and entirely ignores the APIC when built as a PV guest.

Change the annotation from !A to !S.  This has a consequence of stripping it
out of both PV featuremasks.  However, as existing guests may have seen the
bit, set it back into the PV Max policy; a VM which saw the bit and is alive
enough to migrate will have ignored it one way or another.

Hiding x2APIC does change the contents of leaf 0xb, but as the information is
nonsense to begin with, this is likely an improvement on the status quo.

Xen's blind assumption that APIC_ID = vCPU_ID * 2 isn't interlinked with the
host's topology structure, where a PV guest may see real host values, and the
APIC_IDs are useless without an MADT to start with.  Dom0 is the only PV VM to
get an MADT but it's the host one, meaning the two sets of APIC_IDs are from
different address spaces.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5420aa165dfa5fe95dd84bb71cb96c15459935b1
master date: 2024-03-01 20:14:19 +0000

13 months agotools/oxenstored: Make Quota.t pure
Edwin Török [Wed, 31 Jan 2024 10:52:56 +0000 (10:52 +0000)]
tools/oxenstored: Make Quota.t pure

Now that we no longer have a hashtable inside we can make Quota.t pure, and
push the mutable update to its callers.  Store.t already had a mutable Quota.t
field.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit 098d868e52ac0165b7f36e22b767ea70cef70054)

13 months agotools/oxenstored: Use Map instead of Hashtbl for quotas
Edwin Török [Wed, 31 Jan 2024 10:52:55 +0000 (10:52 +0000)]
tools/oxenstored: Use Map instead of Hashtbl for quotas

On a stress test running 1000 VMs flamegraphs have shown that
`oxenstored` spends a large amount of time in `Hashtbl.copy` and the GC.

Hashtable complexity:
 * read/write: O(1) average
 * copy: O(domains) -- copying the entire table

Map complexity:
 * read/write: O(log n) worst case
 * copy: O(1) -- a word copy

We always perform at least one 'copy' when processing each xenstore
packet (regardless whether it is a readonly operation or inside a
transaction or not), so the actual complexity per packet is:
  * Hashtbl: O(domains)
  * Map: O(log domains)

Maps are the clear winner, and a better fit for the immutable xenstore
tree.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit b6cf604207fd0a04451a48f2ce6d05fb66c612ab)

13 months agox86/PoD: tie together P2M update and increment of entry count
Jan Beulich [Wed, 27 Mar 2024 11:29:33 +0000 (12:29 +0100)]
x86/PoD: tie together P2M update and increment of entry count

When not holding the PoD lock across the entire region covering P2M
update and stats update, the entry count - if to be incorrect at all -
should indicate too large a value in preference to a too small one, to
avoid functions bailing early when they find the count is zero. However,
instead of moving the increment ahead (and adjust back upon failure),
extend the PoD-locked region.

Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
master commit: cc950c49ae6a6690f7fc3041a1f43122c250d250
master date: 2024-03-21 09:48:10 +0100

13 months agox86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly
Andrew Cooper [Wed, 27 Mar 2024 11:29:11 +0000 (12:29 +0100)]
x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly

Right now, if the user requests the watchdog on the command line,
setup_apic_nmi_watchdog() will blindly assume that setting up the watchdog
worked.  Reuse nmi_perfctr_msr to identify when the watchdog has been
configured.

Rearrange setup_p6_watchdog() to not set nmi_perfctr_msr until the sanity
checks are complete.  Turn setup_p4_watchdog() into a void function, matching
the others.

If the watchdog isn't set up, inform the user and override to NMI_NONE, which
will prevent check_nmi_watchdog() from claiming that all CPUs are stuck.

e.g.:

  (XEN) alt table ffff82d040697c38 -> ffff82d0406a97f0
  (XEN) Failed to configure NMI watchdog
  (XEN) Brought up 512 CPUs
  (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f658321374687c7339235e1ac643e0427acff717
master date: 2024-03-19 18:29:37 +0000

13 months agox86/mm: use block_lock_speculation() in _mm_write_lock()
Jan Beulich [Wed, 27 Mar 2024 11:28:24 +0000 (12:28 +0100)]
x86/mm: use block_lock_speculation() in _mm_write_lock()

I can only guess that using block_speculation() there was a leftover
from, earlier on, SPECULATIVE_HARDEN_LOCK depending on
SPECULATIVE_HARDEN_BRANCH.

Fixes: 197ecd838a2a ("locking: attempt to ensure lock wrappers are always inline")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 62018f08708a5ff6ef8fc8ff2aaaac46e5a60430
master date: 2024-03-18 13:53:37 +0100

13 months agotools: ipxe: update for fixing build with GCC12
Olaf Hering [Wed, 27 Mar 2024 11:27:03 +0000 (12:27 +0100)]
tools: ipxe: update for fixing build with GCC12

Use a snapshot which includes commit
b0ded89e917b48b73097d3b8b88dfa3afb264ed0 ("[build] Disable dangling
pointer checking for GCC"), which fixes build with gcc12.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18a36b4a9b088875486cfe33a2d4a8ae7eb4ab47
master date: 2023-04-25 23:47:45 +0100

13 months agox86: protect conditional lock taking from speculative execution
Roger Pau Monné [Mon, 4 Mar 2024 15:24:21 +0000 (16:24 +0100)]
x86: protect conditional lock taking from speculative execution

Conditionally taken locks that use the pattern:

if ( lock )
    spin_lock(...);

Need an else branch in order to issue an speculation barrier in the else case,
just like it's done in case the lock needs to be acquired.

eval_nospec() could be used on the condition itself, but that would result in a
double barrier on the branch where the lock is taken.

Introduce a new pair of helpers, {gfn,spin}_lock_if() that can be used to
conditionally take a lock in a speculation safe way.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 03cf7ca23e0e876075954c558485b267b7d02406)

13 months agox86/mm: add speculation barriers to open coded locks
Roger Pau Monné [Mon, 4 Mar 2024 17:08:48 +0000 (18:08 +0100)]
x86/mm: add speculation barriers to open coded locks

Add a speculation barrier to the clearly identified open-coded lock taking
functions.

Note that the memory sharing page_lock() replacement (_page_lock()) is left
as-is, as the code is experimental and not security supported.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 42a572a38e22a97d86a4b648a22597628d5b42e4)

13 months agolocking: attempt to ensure lock wrappers are always inline
Roger Pau Monné [Mon, 4 Mar 2024 13:29:36 +0000 (14:29 +0100)]
locking: attempt to ensure lock wrappers are always inline

In order to prevent the locking speculation barriers from being inside of
`call`ed functions that could be speculatively bypassed.

While there also add an extra locking barrier to _mm_write_lock() in the branch
taken when the lock is already held.

Note some functions are switched to use the unsafe variants (without speculation
barrier) of the locking primitives, but a speculation barrier is always added
to the exposed public lock wrapping helper.  That's the case with
sched_spin_lock_double() or pcidevs_lock() for example.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 197ecd838a2aaf959a469df3696d4559c4f8b762)

13 months agopercpu-rwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 16:57:38 +0000 (17:57 +0100)]
percpu-rwlock: introduce support for blocking speculation into critical regions

Add direct calls to block_lock_speculation() where required in order to prevent
speculation into the lock protected critical regions.  Also convert
_percpu_read_lock() from inline to always_inline.

Note that _percpu_write_lock() has been modified the use the non speculation
safe of the locking primites, as a speculation is added unconditionally by the
calling wrapper.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f218daf6d3a3b847736d37c6a6b76031a0d08441)

13 months agorwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 15:08:52 +0000 (16:08 +0100)]
rwlock: introduce support for blocking speculation into critical regions

Introduce inline wrappers as required and add direct calls to
block_lock_speculation() in order to prevent speculation into the rwlock
protected critical regions.

Note the rwlock primitives are adjusted to use the non speculation safe variants
of the spinlock handlers, as a speculation barrier is added in the rwlock
calling wrappers.

trylock variants are protected by using lock_evaluate_nospec().

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a1fb15f61692b1fa9945fc51f55471ace49cdd59)

13 months agox86/spinlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 12:08:05 +0000 (13:08 +0100)]
x86/spinlock: introduce support for blocking speculation into critical regions

Introduce a new Kconfig option to block speculation into lock protected
critical regions.  The Kconfig option is enabled by default, but the mitigation
won't be engaged unless it's explicitly enabled in the command line using
`spec-ctrl=lock-harden`.

Convert the spinlock acquire macros into always-inline functions, and introduce
a speculation barrier after the lock has been taken.  Note the speculation
barrier is not placed inside the implementation of the spin lock functions, as
to prevent speculation from falling through the call to the lock functions
resulting in the barrier also being skipped.

trylock variants are protected using a construct akin to the existing
evaluate_nospec().

This patch only implements the speculation barrier for x86.

Note spin locks are the only locking primitive taken care in this change,
further locking primitives will be adjusted by separate changes.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7ef0084418e188d05f338c3e028fbbe8b6924afa)

13 months agoxen: Swap order of actions in the FREE*() macros
Andrew Cooper [Fri, 2 Feb 2024 00:39:42 +0000 (00:39 +0000)]
xen: Swap order of actions in the FREE*() macros

Wherever possible, it is a good idea to NULL out the visible reference to an
object prior to freeing it.  The FREE*() macros already collect together both
parts, making it easy to adjust.

This has a marginal code generation improvement, as some of the calls to the
free() function can be tailcall optimised.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c4f427ec879e7c0df6d44d02561e8bee838a293e)

13 months agox86/paging: Delete update_cr3()'s do_locking parameter
Andrew Cooper [Wed, 20 Sep 2023 19:06:53 +0000 (20:06 +0100)]
x86/paging: Delete update_cr3()'s do_locking parameter

Nicola reports that the XSA-438 fix introduced new MISRA violations because of
some incidental tidying it tried to do.  The parameter is useless, so resolve
the MISRA regression by removing it.

hap_update_cr3() discards the parameter entirely, while sh_update_cr3() uses
it to distinguish internal and external callers and therefore whether the
paging lock should be taken.

However, we have paging_lock_recursive() for this purpose, which also avoids
the ability for the shadow internal callers to accidentally not hold the lock.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Reported-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit e71157d1ac2a7fbf413130663cf0a93ff9fbcf7e)

13 months agox86/spec-ctrl: Mitigation Register File Data Sampling
Andrew Cooper [Thu, 22 Jun 2023 22:32:19 +0000 (23:32 +0100)]
x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

13 months agox86/spec-ctrl: VERW-handling adjustments
Andrew Cooper [Tue, 5 Mar 2024 19:33:37 +0000 (19:33 +0000)]
x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

13 months agox86/spec-ctrl: Rename VERW related options
Andrew Cooper [Mon, 12 Feb 2024 17:50:43 +0000 (17:50 +0000)]
x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

13 months agox86/spec-ctrl: Perform VERW flushing later in exit paths
Andrew Cooper [Sat, 27 Jan 2024 18:20:56 +0000 (18:20 +0000)]
x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

13 months agox86/vmx: Perform VERW flushing later in the VMExit path
Andrew Cooper [Fri, 23 Jun 2023 10:32:00 +0000 (11:32 +0100)]
x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

13 months agox86: Resync intel-family.h from Linux
Andrew Cooper [Tue, 27 Feb 2024 16:07:39 +0000 (16:07 +0000)]
x86: Resync intel-family.h from Linux

From v6.8-rc6

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 195e75371b13c4f7ecdf7b5c50aed0d02f2d7ce8)

13 months agox86/entry: Introduce EFRAME_* constants
Andrew Cooper [Sat, 27 Jan 2024 17:52:09 +0000 (17:52 +0000)]
x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

13 months agox86/mm: fix detection of last L1 entry in modify_xen_mappings_lite()
Roger Pau Monné [Tue, 12 Mar 2024 11:08:59 +0000 (12:08 +0100)]
x86/mm: fix detection of last L1 entry in modify_xen_mappings_lite()

The current logic to detect when to switch to the next L1 table is incorrectly
using l2_table_offset() in order to notice when the last entry on the current
L1 table has been reached.

It should instead use l1_table_offset() to check whether the index has wrapped
to point to the first entry, and so the next L1 table should be used.

Fixes: 8676092a0f16 ('x86/livepatch: Fix livepatch application when CET is active')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7c81558208de7858251b62f168a449be84305595
master date: 2024-03-11 11:09:42 +0000

13 months agohvmloader/PCI: skip huge BARs in certain calculations
Jan Beulich [Tue, 12 Mar 2024 11:08:48 +0000 (12:08 +0100)]
hvmloader/PCI: skip huge BARs in certain calculations

BARs of size 2Gb and up can't possibly fit below 4Gb: Both the bottom of
the lower 2Gb range and the top of the higher 2Gb range have special
purpose. Don't even have them influence whether to (perhaps) relocate
low RAM.

Reported-by: Neowutran <xen@neowutran.ovh>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 57acad12a09ffa490e870ebe17596aad858f0191
master date: 2024-03-06 10:19:29 +0100

14 months agox86/cpu-policy: Allow for levelling of VERW side effects
Andrew Cooper [Tue, 5 Mar 2024 11:01:22 +0000 (12:01 +0100)]
x86/cpu-policy: Allow for levelling of VERW side effects

MD_CLEAR and FB_CLEAR need OR-ing across a migrate pool.  Allow this, by
having them unconditinally set in max, with the host values reflected in
default.  Annotate the bits as having special properies.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: de17162cafd27f2865a3102a2ec0f386a02ed03d
master date: 2024-03-01 20:14:19 +0000

14 months agox86/altcall: always use a temporary parameter stashing variable
Roger Pau Monné [Tue, 5 Mar 2024 11:00:47 +0000 (12:00 +0100)]
x86/altcall: always use a temporary parameter stashing variable

The usage in ALT_CALL_ARG() on clang of:

register union {
    typeof(arg) e;
    const unsigned long r;
} ...

When `arg` is the first argument to alternative_{,v}call() and
const_vlapic_vcpu() is used results in clang 3.5.0 complaining with:

arch/x86/hvm/vlapic.c:141:47: error: non-const static data member must be initialized out of line
         alternative_call(hvm_funcs.test_pir, const_vlapic_vcpu(vlapic), vec) )

Workaround this by pulling `arg1` into a local variable, like it's done for
further arguments (arg2, arg3...)

Originally arg1 wasn't pulled into a variable because for the a1_ register
local variable the possible clobbering as a result of operators on other
variables don't matter:

https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables

Note clang version 3.8.1 seems to already be fixed and don't require the
workaround, but since it's harmless do it uniformly everywhere.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: c20850540ad6a32f4fc17bde9b01c92b0df18bf0
master date: 2024-02-29 08:21:49 +0100

14 months agolibxl: Fix segfault in device_model_spawn_outcome
Jason Andryuk [Tue, 5 Mar 2024 11:00:30 +0000 (12:00 +0100)]
libxl: Fix segfault in device_model_spawn_outcome

libxl__spawn_qdisk_backend() explicitly sets guest_config to NULL when
starting QEMU (the usual launch through libxl__spawn_local_dm() has a
guest_config though).

Bail early on a NULL guest_config/d_config.  This skips the QMP queries
for chardevs and VNC, but this xenpv QEMU instance isn't expected to
provide those - only qdisk (or 9pfs backends after an upcoming change).

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d4f3d35f043f6ef29393166b0dd131c8102cf255
master date: 2024-02-29 08:18:38 +0100

14 months agoxen/livepatch: properly build the noapply and norevert tests
Roger Pau Monné [Tue, 5 Mar 2024 10:59:51 +0000 (11:59 +0100)]
xen/livepatch: properly build the noapply and norevert tests

It seems the build variables for those tests where copy-pasted from
xen_action_hooks_marker-objs and not adjusted to use the correct source files.

Fixes: 6047104c3ccc ('livepatch: Add per-function applied/reverted state tracking marker')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: e579677095782c7dec792597ba8b037b7d716b32
master date: 2024-02-28 16:57:25 +0000

14 months agoxen/livepatch: fix norevert test attempt to open-code revert
Roger Pau Monné [Tue, 5 Mar 2024 10:59:43 +0000 (11:59 +0100)]
xen/livepatch: fix norevert test attempt to open-code revert

The purpose of the norevert test is to install a dummy handler that replaces
the internal Xen revert code, and then perform the revert in the post-revert
hook.  For that purpose the usage of the previous common_livepatch_revert() is
not enough, as that just reverts specific functions, but not the whole state of
the payload.

Remove both common_livepatch_{apply,revert}() and instead expose
revert_payload{,_tail}() in order to perform the patch revert from the
post-revert hook.

Fixes: 6047104c3ccc ('livepatch: Add per-function applied/reverted state tracking marker')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: cdae267ce10d04d71d1687b5701ff2911a96b6dc
master date: 2024-02-28 16:57:25 +0000

14 months agoxen/livepatch: search for symbols in all loaded payloads
Roger Pau Monné [Tue, 5 Mar 2024 10:59:35 +0000 (11:59 +0100)]
xen/livepatch: search for symbols in all loaded payloads

When checking if an address belongs to a patch, or when resolving a symbol,
take into account all loaded livepatch payloads, even if not applied.

This is required in order for the pre-apply and post-revert hooks to work
properly, or else Xen won't detect the instruction pointer belonging to those
hooks as being part of the currently active text.

Move the RCU handling to be used for payload_list instead of applied_list, as
now the calls from trap code will iterate over the payload_list.

Fixes: 8313c864fa95 ('livepatch: Implement pre-|post- apply|revert hooks')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: d2daa40fb3ddb8f83e238e57854bd878924cde90
master date: 2024-02-28 16:57:25 +0000

14 months agoxen/livepatch: register livepatch regions when loaded
Roger Pau Monné [Tue, 5 Mar 2024 10:59:26 +0000 (11:59 +0100)]
xen/livepatch: register livepatch regions when loaded

Currently livepatch regions are registered as virtual regions only after the
livepatch has been applied.

This can lead to issues when using the pre-apply or post-revert hooks, as at
that point the livepatch is not in the virtual regions list.  If a livepatch
pre-apply hook contains a WARN() it would trigger an hypervisor crash, as the
code to handle the bug frame won't be able to find the instruction pointer that
triggered the #UD in any of the registered virtual regions, and hence crash.

Fix this by adding the livepatch payloads as virtual regions as soon as loaded,
and only remove them once the payload is unloaded.  This requires some changes
to the virtual regions code, as the removal of the virtual regions is no longer
done in stop machine context, and hence an RCU barrier is added in order to
make sure there are no users of the virtual region after it's been removed from
the list.

Fixes: 8313c864fa95 ('livepatch: Implement pre-|post- apply|revert hooks')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: a57b4074ab39bee78b6c116277f0a9963bd8e687
master date: 2024-02-28 16:57:25 +0000

14 months agox86/spec: do not print thunk option selection if not built-in
Roger Pau Monné [Tue, 5 Mar 2024 10:58:36 +0000 (11:58 +0100)]
x86/spec: do not print thunk option selection if not built-in

Since the thunk built-in enable is printed as part of the "Compiled-in
support:" line, avoid printing anything in "Xen settings:" if the thunk is
disabled at build time.

Note the BTI-Thunk option printing is also adjusted to print a colon in the
same way the other options on the line do.

Requested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 576528a2a742069af203e90c613c5c93e23c9755
master date: 2024-02-27 14:58:40 +0100

14 months agox86/spec: fix INDIRECT_THUNK option to only be set when build-enabled
Roger Pau Monné [Tue, 5 Mar 2024 10:58:04 +0000 (11:58 +0100)]
x86/spec: fix INDIRECT_THUNK option to only be set when build-enabled

Attempt to provide a more helpful error message when the user attempts to set
spec-ctrl=bti-thunk option but the support is build-time disabled.

While there also adjust the command line documentation to mention
CONFIG_INDIRECT_THUNK instead of INDIRECT_THUNK.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8441fa806a3b778867867cd0159fa1722e90397e
master date: 2024-02-27 14:58:20 +0100

14 months agox86/spec: print the built-in SPECULATIVE_HARDEN_* options
Roger Pau Monné [Tue, 5 Mar 2024 10:57:41 +0000 (11:57 +0100)]
x86/spec: print the built-in SPECULATIVE_HARDEN_* options

Just like it's done for INDIRECT_THUNK and SHADOW_PAGING.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e9507f7d51fe49df8bc70f83e49ce06c92e4e54
master date: 2024-02-27 14:57:52 +0100

14 months agoxen/sched: Fix UB shift in compat_set_timer_op()
Andrew Cooper [Tue, 5 Mar 2024 10:57:02 +0000 (11:57 +0100)]
xen/sched: Fix UB shift in compat_set_timer_op()

Tamas reported this UBSAN failure from fuzzing:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in common/sched/compat.c:48:37
  (XEN) left shift of negative value -2147425536
  (XEN) ----[ Xen-4.19-unstable  x86_64  debug=y ubsan=y  Not tainted ]----
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
  (XEN)    [<ffff82d040308afb>] F __ubsan_handle_shift_out_of_bounds+0x11a/0x1c5
  (XEN)    [<ffff82d040307758>] F compat_set_timer_op+0x41/0x43
  (XEN)    [<ffff82d04040e4cc>] F hvm_do_multicall_call+0x77f/0xa75
  (XEN)    [<ffff82d040519462>] F arch_do_multicall_call+0xec/0xf1
  (XEN)    [<ffff82d040261567>] F do_multicall+0x1dc/0xde3
  (XEN)    [<ffff82d04040d2b3>] F hvm_hypercall+0xa00/0x149a
  (XEN)    [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
  (XEN)    [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200

Left-shifting any negative value is strictly undefined behaviour in C, and
the two parameters here come straight from the guest.

The fuzzer happened to choose lo 0xf, hi 0x8000e300.

Switch everything to be unsigned values, making the shift well defined.

As GCC documents:

  As an extension to the C language, GCC does not use the latitude given in
  C99 and C11 only to treat certain aspects of signed '<<' as undefined.
  However, -fsanitize=shift (and -fsanitize=undefined) will diagnose such
  cases.

this was deemed not to need an XSA.

Note: The unsigned -> signed conversion for do_set_timer_op()'s s_time_t
parameter is also well defined.  C makes it implementation defined, and GCC
defines it as reduction modulo 2^N to be within range of the new type.

Fixes: 2942f45e09fb ("Enable compatibility mode operation for HYPERVISOR_sched_op and HYPERVISOR_set_timer_op.")
Reported-by: Tamas K Lengyel <tamas@tklengyel.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ae6d4fd876765e6d623eec67d14f5d0464be09cb
master date: 2024-02-01 19:52:44 +0000

14 months agox86/HVM: hide SVM/VMX when their enabling is prohibited by firmware
Jan Beulich [Tue, 5 Mar 2024 10:56:31 +0000 (11:56 +0100)]
x86/HVM: hide SVM/VMX when their enabling is prohibited by firmware

... or we fail to enable the functionality on the BSP for other reasons.
The only place where hardware announcing the feature is recorded is the
raw CPU policy/featureset.

Inspired by https://lore.kernel.org/all/20230921114940.957141-1-pbonzini@redhat.com/.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0b5f149338e35a795bf609ce584640b0977f9e6c
master date: 2024-01-09 14:06:34 +0100

14 months agoxen/arm: Fix UBSAN failure in start_xen()
Michal Orzel [Thu, 8 Feb 2024 10:43:39 +0000 (11:43 +0100)]
xen/arm: Fix UBSAN failure in start_xen()

When running Xen on arm32, in scenario where Xen is loaded at an address
such as boot_phys_offset >= 2GB, UBSAN reports the following:

(XEN) UBSAN: Undefined behaviour in arch/arm/setup.c:739:58
(XEN) pointer operation underflowed 00200000 to 86800000
(XEN) Xen WARN at common/ubsan/ubsan.c:172
(XEN) ----[ Xen-4.19-unstable  arm32  debug=y ubsan=y  Not tainted ]----
...
(XEN) Xen call trace:
(XEN)    [<0031b4c0>] ubsan.c#ubsan_epilogue+0x18/0xf0 (PC)
(XEN)    [<0031d134>] __ubsan_handle_pointer_overflow+0xb8/0xd4 (LR)
(XEN)    [<0031d134>] __ubsan_handle_pointer_overflow+0xb8/0xd4
(XEN)    [<004d15a8>] start_xen+0xe0/0xbe0
(XEN)    [<0020007c>] head.o#primary_switched+0x4/0x30

The failure is reported for the following line:
(paddr_t)(uintptr_t)(_start + boot_phys_offset)

This occurs because the compiler treats (ptr + size) with size bigger than
PTRDIFF_MAX as undefined behavior. To address this, switch to macro
virt_to_maddr(), given the future plans to eliminate boot_phys_offset.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Tested-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit e11f5766503c0ff074b4e0f888bbfc931518a169)

14 months agox86: account for shadow stack in exception-from-stub recovery
Jan Beulich [Tue, 27 Feb 2024 13:12:11 +0000 (14:12 +0100)]
x86: account for shadow stack in exception-from-stub recovery

Dealing with exceptions raised from within emulation stubs involves
discarding return address (replaced by exception related information).
Such discarding of course also requires removing the corresponding entry
from the shadow stack.

Also amend the comment in fixup_exception_return(), to further clarify
why use of ptr[1] can't be an out-of-bounds access.

This is CVE-2023-46841 / XSA-451.

Fixes: 209fb9919b50 ("x86/extable: Adjust extable handling to be shadow stack compatible")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 91f5f7a9154919a765c3933521760acffeddbf28
master date: 2024-02-27 13:49:22 +0100

14 months agox86/spec: fix BRANCH_HARDEN option to only be set when build-enabled
Roger Pau Monné [Tue, 27 Feb 2024 13:11:40 +0000 (14:11 +0100)]
x86/spec: fix BRANCH_HARDEN option to only be set when build-enabled

The current logic to handle the BRANCH_HARDEN option will report it as enabled
even when build-time disabled. Fix this by only allowing the option to be set
when support for it is built into Xen.

Fixes: 2d6f36daa086 ('x86/nospec: Introduce CONFIG_SPECULATIVE_HARDEN_BRANCH')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 60e00f77a5cc671d30c5ef3318f5b8e9b74e4aa3
master date: 2024-02-26 16:06:42 +0100

14 months agox86/altcall: use a union as register type for function parameters on clang
Roger Pau Monné [Tue, 27 Feb 2024 13:11:04 +0000 (14:11 +0100)]
x86/altcall: use a union as register type for function parameters on clang

The current code for alternative calls uses the caller parameter types as the
types for the register variables that serve as function parameters:

uint8_t foo;
[...]
alternative_call(myfunc, foo);

Would expand roughly into:

register unint8_t a1_ asm("rdi") = foo;
register unsigned long a2_ asm("rsi");
[...]
asm volatile ("call *%c[addr](%%rip)"...);

However with -O2 clang will generate incorrect code, given the following
example:

unsigned int func(uint8_t t)
{
    return t;
}

static void bar(uint8_t b)
{
    int ret_;
    register uint8_t di asm("rdi") = b;
    register unsigned long si asm("rsi");
    register unsigned long dx asm("rdx");
    register unsigned long cx asm("rcx");
    register unsigned long r8 asm("r8");
    register unsigned long r9 asm("r9");
    register unsigned long r10 asm("r10");
    register unsigned long r11 asm("r11");

    asm volatile ( "call %c[addr]"
                   : "+r" (di), "=r" (si), "=r" (dx),
                     "=r" (cx), "=r" (r8), "=r" (r9),
                     "=r" (r10), "=r" (r11), "=a" (ret_)
                   : [addr] "i" (&(func)), "g" (func)
                   : "memory" );
}

void foo(unsigned int a)
{
    bar(a);
}

Clang generates the following assembly code:

func:                                   # @func
        movl    %edi, %eax
        retq
foo:                                    # @foo
        callq   func
        retq

Note the truncation of the unsigned int parameter 'a' of foo() to uint8_t when
passed into bar() is lost.  clang doesn't zero extend the parameters in the
callee when required, as the psABI mandates.

The above can be worked around by using a union when defining the register
variables, so that `di` becomes:

register union {
    uint8_t e;
    unsigned long r;
} di asm("rdi") = { .e = b };

Which results in following code generated for `foo()`:

foo:                                    # @foo
        movzbl  %dil, %edi
        callq   func
        retq

So the truncation is not longer lost.  Apply such workaround only when built
with clang.

Reported-by: Matthew Grooms <mgrooms@shrew.net>
Link: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277200
Link: https://github.com/llvm/llvm-project/issues/12579
Link: https://github.com/llvm/llvm-project/issues/82598
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2ce562b2a413cbdb2e1128989ed1722290a27c4e
master date: 2024-02-26 10:18:01 +0100

14 months agoxen/cmdline: fix printf format specifier in no_config_param()
Roger Pau Monné [Tue, 27 Feb 2024 13:10:39 +0000 (14:10 +0100)]
xen/cmdline: fix printf format specifier in no_config_param()

'*' sets the width field, which is the minimum number of characters to output,
but what we want in no_config_param() is the precision instead, which is '.*'
as it imposes a maximum limit on the output.

Fixes: 68d757df8dd2 ('x86/pv: Options to disable and/or compile out 32bit PV support')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ef101f525173cf51dc70f4c77862f6f10a8ddccf
master date: 2024-02-26 10:17:40 +0100

14 months agoxen/livepatch: fix norevert test hook setup typo
Roger Pau Monné [Tue, 27 Feb 2024 13:10:24 +0000 (14:10 +0100)]
xen/livepatch: fix norevert test hook setup typo

The test code has a typo in using LIVEPATCH_APPLY_HOOK() instead of
LIVEPATCH_REVERT_HOOK().

Fixes: 6047104c3ccc ('livepatch: Add per-function applied/reverted state tracking marker')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: f0622dd4fd6ae6ddb523a45d89ed9b8f3a9a8f36
master date: 2024-02-26 10:13:46 +0100

14 months agox86emul: add missing EVEX.R' checks
Jan Beulich [Tue, 27 Feb 2024 13:09:55 +0000 (14:09 +0100)]
x86emul: add missing EVEX.R' checks

EVEX.R' is not ignored in 64-bit code when encoding a GPR or mask
register. While for mask registers suitable checks are in place (there
also covering EVEX.R), they were missing for the few cases where in
EVEX-encoded instructions ModR/M.reg encodes a GPR. While for VPEXTRW
the bit is replaced before an emulation stub is invoked, for
VCVT{,T}{S,D,H}2{,U}SI this actually would have led to #UD from inside
an emulation stub, in turn raising #UD to the guest, but accompanied by
log messages indicating something's wrong in Xen nevertheless.

Fixes: 001bd91ad864 ("x86emul: support AVX512{F,BW,DQ} extract insns")
Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb319824bfa8d3c9ea0410cc71daaedc3e11aa2a
master date: 2024-02-22 11:54:07 +0100

14 months agobuild: make sure build fails when running kconfig fails
Jan Beulich [Tue, 27 Feb 2024 13:09:37 +0000 (14:09 +0100)]
build: make sure build fails when running kconfig fails

Because of using "-include", failure to (re)build auto.conf (with
auto.conf.cmd produced as a secondary target) won't stop make from
continuing the build. Arrange for it being possible to drop the - from
Rules.mk, requiring that the include be skipped for tools-only targets.
Note that relying on the inclusion in those cases wouldn't be correct
anyway, as it might be a stale file (yet to be rebuilt) which would be
included, while during initial build, the file would be absent
altogether.

Fixes: 8d4c17a90b0a ("xen/build: silence make warnings about missing auto.conf*")
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d34e5fa2e8db19f23081f46a3e710bb122130691
master date: 2024-02-22 11:52:47 +0100

14 months agolibxl: Disable relocating memory for qemu-xen in stubdomain too
Marek Marczykowski-Górecki [Tue, 27 Feb 2024 13:09:19 +0000 (14:09 +0100)]
libxl: Disable relocating memory for qemu-xen in stubdomain too

According to comments (and experiments) qemu-xen cannot handle memory
reolcation done by hvmloader. The code was already disabled when running
qemu-xen in dom0 (see libxl__spawn_local_dm()), but it was missed when
adding qemu-xen support to stubdomain. Adjust libxl__spawn_stub_dm() to
be consistent in this regard.

Reported-by: Neowutran <xen@neowutran.ovh>
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 97883aa269f6745a6ded232be3a855abb1297e0d
master date: 2024-02-22 11:48:22 +0100

14 months agobuild: Replace `which` with `command -v`
Anthony PERARD [Tue, 27 Feb 2024 13:08:50 +0000 (14:08 +0100)]
build: Replace `which` with `command -v`

The `which` command is not standard, may not exist on the build host,
or may not behave as expected by the build system. It is recommended
to use `command -v` to find out if a command exist and have its path,
and it's part of a POSIX shell standard (at least, it seems to be
mandatory since IEEE Std 1003.1-2008, but was optional before).

Fixes: c8a8645f1efe ("xen/build: Automatically locate a suitable python interpreter")
Fixes: 3b47bcdb6d38 ("xen/build: Use a distro version of figlet")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f93629b18b528a5ab1b1092949c5420069c7226c
master date: 2024-02-19 12:45:48 +0100

14 months agox86/HVM: tidy state on hvmemul_map_linear_addr()'s error path
Jan Beulich [Tue, 27 Feb 2024 13:08:20 +0000 (14:08 +0100)]
x86/HVM: tidy state on hvmemul_map_linear_addr()'s error path

While in the vast majority of cases failure of the function will not
be followed by re-invocation with the same emulation context, a few
very specific insns - involving multiple independent writes, e.g. ENTER
and PUSHA - exist where this can happen. Since failure of the function
only signals to the caller that it ought to try an MMIO write instead,
such failure also cannot be assumed to result in wholesale failure of
emulation of the current insn. Instead we have to maintain internal
state such that another invocation of the function with the same
emulation context remains possible. To achieve that we need to reset MFN
slots after putting page references on the error path.

Note that all of this affects debugging code only, in causing an
assertion to trigger (higher up in the function). There's otherwise no
misbehavior - such a "leftover" slot would simply be overwritten by new
contents in a release build.

Also extend the related unmap() assertion, to further check for MFN 0.

Fixes: 8cbd4fb0b7ea ("x86/hvm: implement hvmemul_write() using real mappings")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Paul Durrant <paul@xen.org>
master commit: e72f951df407bc3be82faac64d8733a270036ba1
master date: 2024-02-13 09:36:14 +0100

14 months agox86/hvm: Fix fast singlestep state persistence
Petr Beneš [Tue, 27 Feb 2024 13:07:45 +0000 (14:07 +0100)]
x86/hvm: Fix fast singlestep state persistence

This patch addresses an issue where the fast singlestep setting would persist
despite xc_domain_debug_control being called with XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_OFF.
Specifically, if fast singlestep was enabled in a VMI session and that session
stopped before the MTF trap occurred, the fast singlestep setting remained
active even though MTF itself was disabled.  This led to a situation where, upon
starting a new VMI session, the first event to trigger an EPT violation would
cause the corresponding EPT event callback to be skipped due to the lingering
fast singlestep setting.

The fix ensures that the fast singlestep setting is properly reset when
disabling single step debugging operations.

Signed-off-by: Petr Beneš <w1benny@gmail.com>
Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 897def94b56175ce569673a05909d2f223e1e749
master date: 2024-02-12 09:37:58 +0100

14 months agoamd-vi: fix IVMD memory type checks
Roger Pau Monné [Tue, 27 Feb 2024 13:07:12 +0000 (14:07 +0100)]
amd-vi: fix IVMD memory type checks

The current code that parses the IVMD blocks is relaxed with regard to the
restriction that such unity regions should always fall into memory ranges
marked as reserved in the memory map.

However the type checks for the IVMD addresses are inverted, and as a result
IVMD ranges falling into RAM areas are accepted.  Note that having such ranges
in the first place is a firmware bug, as IVMD should always fall into reserved
ranges.

Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to be reserved')
Reported-by: Ox <oxjo@proton.me>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: oxjo <oxjo@proton.me>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 83afa313583019d9f159c122cecf867735d27ec5
master date: 2024-02-06 11:56:13 +0100

14 months agotools/xentop: fix sorting bug for some columns
Cyril Rébert (zithro) [Tue, 27 Feb 2024 13:06:53 +0000 (14:06 +0100)]
tools/xentop: fix sorting bug for some columns

Sort doesn't work on columns VBD_OO, VBD_RD, VBD_WR and VBD_RSECT.
Fix by adjusting variables names in compare functions.
Bug fix only. No functional change.

Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.")
Signed-off-by: Cyril Rébert (zithro) <slack@rabbit.lu>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 29f17d837421f13c0e0010802de1b2d51d2ded4a
master date: 2024-02-05 17:58:23 +0000

15 months agox86/p2m-pt: fix off by one in entry check assert
Roger Pau Monné [Fri, 2 Feb 2024 07:04:33 +0000 (08:04 +0100)]
x86/p2m-pt: fix off by one in entry check assert

The MMIO RO rangeset overlap check is bogus: the rangeset is inclusive so the
passed end mfn should be the last mfn to be mapped (not last + 1).

Fixes: 6fa1755644d0 ('amd/npt/shadow: replace assert that prevents creating 2M/1G MMIO entries')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
master commit: 610775d0dd61c1bd2f4720c755986098e6a5bafd
master date: 2024-01-25 16:09:04 +0100

15 months agolib{fdt,elf}: move lib{fdt,elf}-temp.o and their deps to $(targets)
Michal Orzel [Fri, 2 Feb 2024 07:04:07 +0000 (08:04 +0100)]
lib{fdt,elf}: move lib{fdt,elf}-temp.o and their deps to $(targets)

At the moment, trying to run xencov read/reset (calling SYSCTL_coverage_op
under the hood) results in a crash. This is due to a profiler trying to
access data in the .init.* sections (libfdt for Arm and libelf for x86)
that are stripped after boot. Normally, the build system compiles any
*.init.o file without COV_FLAGS. However, these two libraries are
handled differently as sections will be renamed to init after linking.

To override COV_FLAGS to empty for these libraries, lib{fdt,elf}.o were
added to nocov-y. This worked until e321576f4047 ("xen/build: start using
if_changed") that added lib{fdt,elf}-temp.o and their deps to extra-y.
This way, even though these objects appear as prerequisites of
lib{fdt,elf}.o and the settings should propagate to them, make can also
build them as a prerequisite of __build, in which case COV_FLAGS would
still have the unwanted flags. Fix it by switching to $(targets) instead.

Also, for libfdt, append libfdt.o to nocov-y only if CONFIG_OVERLAY_DTB
is not set. Otherwise, there is no section renaming and we should be able
to run the coverage.

Fixes: e321576f4047 ("xen/build: start using if_changed")
Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 79519fcfa0605bbf19d8c02b979af3a2c8afed68
master date: 2024-01-23 12:02:44 +0100

15 months agox86/vmx: Disallow the use of inactivity states
Andrew Cooper [Fri, 2 Feb 2024 07:03:26 +0000 (08:03 +0100)]
x86/vmx: Disallow the use of inactivity states

Right now, vvmx will blindly copy L12's ACTIVITY_STATE into the L02 VMCS and
enter the vCPU.  Luckily for us, nested-virt is explicitly unsupported for
security bugs.

The inactivity states are HLT, SHUTDOWN and WAIT-FOR-SIPI, and as noted by the
SDM in Vol3 27.7 "Special Features of VM Entry":

  If VM entry ends with the logical processor in an inactive activity state,
  the VM entry generates any special bus cycle that is normally generated when
  that activity state is entered from the active state.

Also,

  Some activity states unconditionally block certain events.

I.e. A VMEntry with ACTIVITY=SHUTDOWN will initiate a platform reset, while a
VMEntry with ACTIVITY=WAIT-FOR-SIPI will really block everything other than
SIPIs.

Both of these activity states are for the TXT ACM to use, not for regular
hypervisors, and Xen doesn't support dropping the HLT intercept either.

There are two paths in Xen which operate on ACTIVITY_STATE.

1) The vmx_{get,set}_nonreg_state() helpers for VM-Fork.

   As regular VMs can't use any inactivity states, this is just duplicating
   the 0 from construct_vmcs().  Retain the ability to query activity_state,
   but crash the domain on any attempt to set an inactivity state.

2) Nested virt, because of ACTIVITY_STATE in vmcs_gstate_field[].

   Explicitly hide the inactivity states in the guest's view of MSR_VMX_MISC,
   and remove ACTIVITY_STATE from vmcs_gstate_field[].

   In virtual_vmentry(), we should trigger a VMEntry failure for the use of
   any inactivity states, but there's no support for that in the code at all
   so leave a TODO for when we finally start working on nested-virt in
   earnest.

Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 3643bb53a05b7c8fbac072c63bef1538f2a6d0d2
master date: 2024-01-18 20:59:06 +0000

15 months agox86/vmx: Fix IRQ handling for EXIT_REASON_INIT
Andrew Cooper [Fri, 2 Feb 2024 07:02:51 +0000 (08:02 +0100)]
x86/vmx: Fix IRQ handling for EXIT_REASON_INIT

When receiving an INIT, a prior bugfix tried to ignore the INIT and continue
onwards.

Unfortunately it's not safe to return at that point in vmx_vmexit_handler().
Just out of context in the first hunk is a local_irqs_enabled() which is
depended-upon by the return-to-guest path, causing the following checklock
failure in debug builds:

  (XEN) Error: INIT received - ignoring
  (XEN) CHECKLOCK FAILURE: prev irqsafe: 0, curr irqsafe 1
  (XEN) Xen BUG at common/spinlock.c:132
  (XEN) ----[ Xen-4.19-unstable  x86_64  debug=y  Tainted:     H  ]----
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d040238e10>] R check_lock+0xcd/0xe1
  (XEN)    [<ffff82d040238fe3>] F _spin_lock+0x1b/0x60
  (XEN)    [<ffff82d0402ed6a8>] F pt_update_irq+0x32/0x3bb
  (XEN)    [<ffff82d0402b9632>] F vmx_intr_assist+0x3b/0x51d
  (XEN)    [<ffff82d040206447>] F vmx_asm_vmexit_handler+0xf7/0x210

Luckily, this is benign in release builds.  Accidentally having IRQs disabled
when trying to take an IRQs-on lock isn't a deadlock-vulnerable pattern.

Drop the problematic early return.  In hindsight, it's wrong to skip other
normal VMExit steps.

Fixes: b1f11273d5a7 ("x86/vmx: Don't spuriously crash the domain when INIT is received")
Reported-by: Reima ISHII <ishiir@g.ecc.u-tokyo.ac.jp>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d1f8883aebe00f6a9632d77ab0cd5c6d02c9cbe4
master date: 2024-01-18 20:59:06 +0000

15 months agox86/intel: ensure Global Performance Counter Control is setup correctly
Roger Pau Monné [Fri, 2 Feb 2024 07:02:20 +0000 (08:02 +0100)]
x86/intel: ensure Global Performance Counter Control is setup correctly

When Architectural Performance Monitoring is available, the PERF_GLOBAL_CTRL
MSR contains per-counter enable bits that is ANDed with the enable bit in the
counter EVNTSEL MSR in order for a PMC counter to be enabled.

So far the watchdog code seems to have relied on the PERF_GLOBAL_CTRL enable
bits being set by default, but at least on some Intel Sapphire and Emerald
Rapids this is no longer the case, and Xen reports:

Testing NMI watchdog on all CPUs: 0 40 stuck

The first CPU on each package is started with PERF_GLOBAL_CTRL zeroed, so PMC0
doesn't start counting when the enable bit in EVNTSEL0 is set, due to the
relevant enable bit in PERF_GLOBAL_CTRL not being set.

Check and adjust PERF_GLOBAL_CTRL during CPU initialization so that all the
general-purpose PMCs are enabled.  Doing so brings the state of the package-BSP
PERF_GLOBAL_CTRL in line with the rest of the CPUs on the system.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6bdb965178bbb3fc50cd4418d4770a7789956e2c
master date: 2024-01-17 10:40:52 +0100

15 months agoCirrusCI: drop FreeBSD 12
Roger Pau Monné [Fri, 2 Feb 2024 07:01:50 +0000 (08:01 +0100)]
CirrusCI: drop FreeBSD 12

Went EOL by the end of December 2023, and the pkg repos have been shut down.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c2ce3466472e9c9eda79f5dc98eb701bc6fdba20
master date: 2024-01-15 12:20:11 +0100

15 months agox86/amd: Extend CPU erratum #1474 fix to more affected models
Roger Pau Monné [Fri, 2 Feb 2024 07:01:09 +0000 (08:01 +0100)]
x86/amd: Extend CPU erratum #1474 fix to more affected models

Erratum #1474 has now been extended to cover models from family 17h ranges
00-2Fh, so the errata now covers all the models released under Family
17h (Zen, Zen+ and Zen2).

Additionally extend the workaround to Family 18h (Hygon), since it's based on
the Zen architecture and very likely affected.

Rename all the zen2 related symbols to fam17, since the errata doesn't
exclusively affect Zen2 anymore.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23db507a01a4ec5259ec0ab43d296a41b1c326ba
master date: 2023-12-21 12:19:40 +0000

15 months agoVT-d: Fix "else" vs "#endif" misplacement
Andrew Cooper [Tue, 30 Jan 2024 13:38:38 +0000 (14:38 +0100)]
VT-d: Fix "else" vs "#endif" misplacement

In domain_pgd_maddr() the "#endif" is misplaced with respect to "else".  This
generates incorrect logic when CONFIG_HVM is compiled out, as the "else" body
is executed unconditionally.

Rework the logic to use IS_ENABLED() instead of explicit #ifdef-ary, as it's
clearer to follow.  This in turn involves adjusting p2m_get_pagetable() to
compile when CONFIG_HVM is disabled.

This is XSA-450 / CVE-2023-46840.

Fixes: 033ff90aa9c1 ("x86/P2M: p2m_{alloc,free}_ptp() and p2m_alloc_table() are HVM-only")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cc6ba68edf6dcd18c3865e7d7c0f1ed822796426
master date: 2024-01-30 14:29:15 +0100

15 months agopci: fail device assignment if phantom functions cannot be assigned
Roger Pau Monné [Tue, 30 Jan 2024 13:37:39 +0000 (14:37 +0100)]
pci: fail device assignment if phantom functions cannot be assigned

The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.

This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0.  Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.

Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.

Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.

Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.

This is XSA-449 / CVE-2023-46839

Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb4ecb3cc17b02c2814bc817efd05f3f3ba33d1e
master date: 2024-01-30 14:28:01 +0100

15 months agoupdate Xen version to 4.17.4-pre
Jan Beulich [Tue, 30 Jan 2024 13:36:44 +0000 (14:36 +0100)]
update Xen version to 4.17.4-pre

16 months agoupdate Xen version to 4.17.3 RELEASE-4.17.3
Jan Beulich [Mon, 18 Dec 2023 12:10:25 +0000 (13:10 +0100)]
update Xen version to 4.17.3

16 months agoxen/arm: page: Avoid pointer overflow on cache clean & invalidate
Michal Orzel [Tue, 12 Dec 2023 13:51:20 +0000 (14:51 +0100)]
xen/arm: page: Avoid pointer overflow on cache clean & invalidate

On Arm32, after cleaning and invalidating the last dcache line of the top
domheap page i.e. VA = 0xfffff000 (as a result of flushing the page to
RAM), we end up adding the value of a dcache line size to the pointer
once again, which results in a pointer arithmetic overflow (with 64B line
size, operation 0xffffffc0 + 0x40 overflows to 0x0). Such behavior is
undefined and given the wide range of compiler versions we support, it is
difficult to determine what could happen in such scenario.

Modify clean_and_invalidate_dcache_va_range() as well as
clean_dcache_va_range() and invalidate_dcache_va_range() due to similarity
of handling to prevent pointer arithmetic overflow. Modify the loops to
use an additional variable to store the index of the next cacheline.
Add an assert to prevent passing a region that wraps around which is
illegal and would end up in a page fault anyway (region 0-2MB is
unmapped). Lastly, return early if size passed is 0.

Note that on Arm64, we don't have this problem given that the max VA
space we support is 48-bits.

This is XSA-447 / CVE-2023-46837.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 190b7f49af6487a9665da63d43adc9d9a5fbd01e
master date: 2023-12-12 14:01:00 +0100

16 months agoxen/sched: fix sched_move_domain()
Juergen Gross [Tue, 12 Dec 2023 13:50:10 +0000 (14:50 +0100)]
xen/sched: fix sched_move_domain()

Do cleanup in sched_move_domain() in a dedicated service function,
which is called either in error case with newly allocated data, or in
success case with the old data to be freed.

This will at once fix some subtle bugs which sneaked in due to
forgetting to overwrite some pointers in the error case.

Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity")
Reported-by: René Winther Højgaard <renewin@proton.me>
Initial-fix-by: Jan Beulich <jbeulich@suse.com>
Initial-fix-by: George Dunlap <george.dunlap@cloud.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@cloud.com>
master commit: 23792cc0f22cff4e106d838b83aa9ae1cb6ffaf4
master date: 2023-12-07 13:37:25 +0000

16 months agoOnly compile the hypervisor with -Wdeclaration-after-statement
Julien Grall [Tue, 12 Dec 2023 13:49:55 +0000 (14:49 +0100)]
Only compile the hypervisor with -Wdeclaration-after-statement

Right now, all tools and hypervisor will be complied with the option
-Wdeclaration-after-statement. While most of the code in the hypervisor
is controlled by us, for tools we may import external libraries.

The build will fail if one of them are using the construct we are
trying to prevent. This is the case when building against Python 3.12
and Yocto:

| In file included from /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/Python.h:44,
|                  from xen/lowlevel/xc/xc.c:8:
| /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/object.h: In function 'Py_SIZE':
| /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/object.h:233:5: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
|   233 |     PyVarObject *var_ob = _PyVarObject_CAST(ob);
|       |     ^~~~~~~~~~~
| In file included from /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/Python.h:53:
| /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/cpython/longintrepr.h: In function '_PyLong_CompactValue':
| /srv/storage/alex/yocto/build-virt/tmp/work/core2-64-poky-linux/xen-tools/4.17+stable/recipe-sysroot/usr/include/python3.12/cpython/longintrepr.h:121:5: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement]
|   121 |     Py_ssize_t sign = 1 - (op->long_value.lv_tag & _PyLong_SIGN_MASK);
|       |     ^~~~~~~~~~
| cc1: all warnings being treated as errors

Looking at the tools directory, a fair few directory already add
-Wno-declaration-after-statement to inhibit the default behavior.

We have always build the hypervisor with the flag, so for now remove
only the flag for anything but the hypervisor. We can decide at later
time whether we want to relax.

Also remove the -Wno-declaration-after-statement in some subdirectory
as the flag is now unnecessary.

Part of the commit message was take from Alexander's first proposal:

Link: https://lore.kernel.org/xen-devel/20231128174729.3880113-1-alex@linutronix.de/
Reported-by: Alexander Kanavin <alex@linutronix.de>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
xen/hypervisor: Don't use cc-option-add for -Wdeclaration-after-statement

Per Andrew's comment in [1] all the compilers we support should
recognize the flag.

I forgot to address the comment while committing.

[1] fcf00090-304a-49f7-8a61-a54347e90a3b@citrix.com

Signed-off-by: Julien Grall <jgrall@amazon.com>
master commit: 40be6307ec005539635e7b8fcef67e989dc441f6
master date: 2023-12-06 19:12:40 +0000
master commit: d4bfd3899886d0fbe259c20660dadb1e00170f2d
master date: 2023-12-06 19:19:59 +0000

16 months agox86/x2apic: introduce a mixed physical/cluster mode
Roger Pau Monné [Tue, 12 Dec 2023 13:47:02 +0000 (14:47 +0100)]
x86/x2apic: introduce a mixed physical/cluster mode

The current implementation of x2APIC requires to either use Cluster Logical or
Physical mode for all interrupts.  However the selection of Physical vs Logical
is not done at APIC setup, an APIC can be addressed both in Physical or Logical
destination modes concurrently.

Introduce a new x2APIC mode called Mixed, which uses Logical Cluster mode for
IPIs, and Physical mode for external interrupts, thus attempting to use the
best method for each interrupt type.

Using Physical mode for external interrupts allows more vectors to be used, and
interrupt balancing to be more accurate.

Using Logical Cluster mode for IPIs allows fewer accesses to the ICR register
when sending those, as multiple CPUs can be targeted with a single ICR register
write.

A simple test calling flush_tlb_all() 10000 times on a tight loop on AMD EPYC
9754 with 512 CPUs gives the following figures in nano seconds:

x mixed
+ phys
* cluster
    N           Min           Max        Median           Avg        Stddev
x  25 3.5131328e+08 3.5716441e+08 3.5410987e+08 3.5432659e+08     1566737.4
+  12  1.231082e+09  1.238824e+09 1.2370528e+09 1.2357981e+09     2853892.9
Difference at 95.0% confidence
8.81472e+08 +/- 1.46849e+06
248.774% +/- 0.96566%
(Student's t, pooled s = 2.05985e+06)
*  11 3.5099276e+08 3.5561459e+08 3.5461234e+08 3.5415668e+08     1415071.9
No difference proven at 95.0% confidence

So Mixed has no difference when compared to Cluster mode, and Physical mode is
248% slower when compared to either Mixed or Cluster modes with a 95%
confidence.

Note that Xen uses Cluster mode by default, and hence is already using the
fastest way for IPI delivery at the cost of reducing the amount of vectors
available system-wide.

Make the newly introduced mode the default one.

Note the printing of the APIC addressing mode done in connect_bsp_APIC() has
been removed, as with the newly introduced mixed mode this would require more
fine grained printing, or else would be incorrect.  The addressing mode can
already be derived from the APIC driver in use, which is printed by different
helpers.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
master commit: e3c409d59ac87ccdf97b8c7708c81efa8069cb31
master date: 2023-11-07 09:59:48 +0000

17 months agoxen/domain: fix error path in domain_create()
Stewart Hildebrand [Wed, 6 Dec 2023 09:49:54 +0000 (10:49 +0100)]
xen/domain: fix error path in domain_create()

If rangeset_new() fails, err would not be set to an appropriate error
code. Set it to -ENOMEM.

Fixes: 580c458699e3 ("xen/domain: Call arch_domain_create() as early as possible in domain_create()")
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ff1178062094837d55ef342070e58316c43a54c9
master date: 2023-12-05 10:00:51 +0100

17 months agoxen/sched: fix adding offline cpu to cpupool
Juergen Gross [Wed, 6 Dec 2023 09:49:29 +0000 (10:49 +0100)]
xen/sched: fix adding offline cpu to cpupool

Trying to add an offline cpu to a cpupool can crash the hypervisor,
as the probably non-existing percpu area of the cpu is accessed before
the availability of the cpu is being tested. This can happen in case
the cpupool's granularity is "core" or "socket".

Fix that by testing the cpu to be online.

Fixes: cb563d7665f2 ("xen/sched: support core scheduling for moving cpus to/from cpupools")
Reported-by: René Winther Højgaard <renewin@proton.me>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 06e8d65d33896aa90f5b6d9b2bce7f11433b33c9
master date: 2023-12-05 09:57:38 +0100

17 months agox86emul: avoid triggering event related assertions
Jan Beulich [Wed, 6 Dec 2023 09:48:56 +0000 (10:48 +0100)]
x86emul: avoid triggering event related assertions

The assertion at the end of x86_emulate_wrapper() as well as the ones
in x86_emul_{hw_exception,pagefault}() can trigger if we ignore
X86EMUL_EXCEPTION coming back from certain hook functions. Squash
exceptions when merely probing MSRs, plus on SWAPGS'es "best effort"
error handling path.

In adjust_bnd() add another assertion after the read_xcr(0, ...)
invocation, paralleling the one in x86emul_get_fpu() - XCR0 reads should
never fault when XSAVE is (implicitly) known to be available.

Also update the respective comment in x86_emulate_wrapper().

Fixes: 14a6be89ec04 ("x86emul: correct EFLAGS.TF handling")
Fixes: cb2626c75813 ("x86emul: conditionally clear BNDn for branches")
Fixes: 6eb43fcf8a0b ("x86emul: support SWAPGS")
Reported-by: AFL
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 787d11c5aaf4d3411d4658cff137cd49b0bd951b
master date: 2023-12-05 09:57:05 +0100

17 months agotools/xg: Fix potential memory leak in cpu policy getters/setters
Alejandro Vallejo [Wed, 6 Dec 2023 09:48:32 +0000 (10:48 +0100)]
tools/xg: Fix potential memory leak in cpu policy getters/setters

They allocate two different hypercall buffers, but leak the first
allocation if the second one failed due to an early return that bypasses
cleanup.

Remove the early exit and go through _post() instead. Invoking _post() is
benign even if _pre() failed.

Fixes: 6b85e427098c ('x86/sysctl: Implement XEN_SYSCTL_get_cpu_policy')
Fixes: 60529dfeca14 ('x86/domctl: Implement XEN_DOMCTL_get_cpu_policy')
Fixes: 14ba07e6f816 ('x86/domctl: Implement XEN_DOMCTL_set_cpumsr_policy')
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 1571ff7a987b88b20598a6d49910457f3b2c59f1
master date: 2023-12-01 10:53:07 +0100

17 months agoxen/x86: In x2APIC mode, derive LDR from APIC ID
Alejandro Vallejo [Wed, 6 Dec 2023 09:47:52 +0000 (10:47 +0100)]
xen/x86: In x2APIC mode, derive LDR from APIC ID

Both Intel and AMD manuals agree that in x2APIC mode, the APIC LDR and ID
registers are derivable from each other through a fixed formula.

Xen uses that formula, but applies it to vCPU IDs (which are sequential)
rather than x2APIC IDs (which are not, at the moment). As I understand it,
this is an attempt to tightly pack vCPUs into clusters so each cluster has
16 vCPUs rather than 8, but this is a spec violation.

This patch fixes the implementation so we follow the x2APIC spec for new
VMs, while preserving the behaviour (buggy or fixed) for migrated-in VMs.

While touching that area, remove the existing printk statement in
vlapic_load_fixup() (as the checks it performed didn't make sense in x2APIC
mode and wouldn't affect the outcome) and put another printk as an else
branch so we get warnings trying to load nonsensical LDR values we don't
know about.

Fixes: f9e0cccf7b35 ("x86/HVM: fix ID handling of x2APIC emulation")
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 90309854fd2440fb08b4c808f47d7670ba0d250d
master date: 2023-11-29 10:05:55 +0100

17 months agolivepatch: do not use .livepatch.funcs section to store internal state
Roger Pau Monné [Wed, 6 Dec 2023 09:46:47 +0000 (10:46 +0100)]
livepatch: do not use .livepatch.funcs section to store internal state

Currently the livepatch logic inside of Xen will use fields of struct
livepatch_func in order to cache internal state of patched functions.  Note
this is a field that is part of the payload, and is loaded as an ELF section
(.livepatch.funcs), taking into account the SHF_* flags in the section
header.

The flags for the .livepatch.funcs section, as set by livepatch-build-tools,
are SHF_ALLOC, which leads to its contents (the array of livepatch_func
structures) being placed in read-only memory:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
[...]
  [ 4] .livepatch.funcs  PROGBITS         0000000000000000  00000080
       0000000000000068  0000000000000000   A       0     0     8

This previously went unnoticed, as all writes to the fields of livepatch_func
happen in the critical region that had WP disabled in CR0.  After 8676092a0f16
however WP is no longer toggled in CR0 for patch application, and only the
hypervisor .text mappings are made write-accessible.  That leads to the
following page fault when attempting to apply a livepatch:

----[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C    ]----
CPU:    4
RIP:    e008:[<ffff82d040221e81>] common/livepatch.c#apply_payload+0x45/0x1e1
[...]
Xen call trace:
   [<ffff82d040221e81>] R common/livepatch.c#apply_payload+0x45/0x1e1
   [<ffff82d0402235b2>] F check_for_livepatch_work+0x385/0xaa5
   [<ffff82d04032508f>] F arch/x86/domain.c#idle_loop+0x92/0xee

Pagetable walk from ffff82d040625079:
 L4[0x105] = 000000008c6c9063 ffffffffffffffff
 L3[0x141] = 000000008c6c6063 ffffffffffffffff
 L2[0x003] = 000000086a1e7063 ffffffffffffffff
 L1[0x025] = 800000086ca5d121 ffffffffffffffff

****************************************
Panic on CPU 4:
FATAL PAGE FAULT
[error_code=0003]
Faulting linear address: ffff82d040625079
****************************************

Fix this by moving the internal Xen function patching state out of
livepatch_func into an area not allocated as part of the ELF payload.  While
there also constify the array of livepatch_func structures in order to prevent
further surprises.

Note there's still one field (old_addr) that gets set during livepatch load.  I
consider this fine since the field is read-only after load, and at the point
the field gets set the underlying mapping hasn't been made read-only yet.

Fixes: 8676092a0f16 ('x86/livepatch: Fix livepatch application when CET is active')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
xen/livepatch: fix livepatch tests

The current set of in-tree livepatch tests in xen/test/livepatch started
failing after the constify of the payload funcs array, and the movement of the
status data into a separate array.

Fix the tests so they respect the constness of the funcs array and also make
use of the new location of the per-func state data.

Fixes: 82182ad7b46e ('livepatch: do not use .livepatch.funcs section to store internal state')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 82182ad7b46e0f7a3856bb12c7a9bf2e2a4570bc
master date: 2023-11-27 15:16:01 +0100
master commit: 902377b690f42ddf44ae91c4b0751d597f1cd694
master date: 2023-11-29 10:46:42 +0000

17 months agox86/mem_sharing: Release domain if we are not able to enable memory sharing
Frediano Ziglio [Wed, 6 Dec 2023 09:46:01 +0000 (10:46 +0100)]
x86/mem_sharing: Release domain if we are not able to enable memory sharing

In case it's not possible to enable memory sharing (mem_sharing_control
fails) we just return the error code without releasing the domain
acquired some lines above by rcu_lock_live_remote_domain_by_id().

Fixes: 72f8d45d69b8 ("x86/mem_sharing: enable mem_sharing on first memop")
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: fbcec32d6d3ea0ac329301925b317478316209ed
master date: 2023-11-27 12:06:13 +0000

17 months agoxen/sched: fix sched_move_domain()
Juergen Gross [Thu, 23 Nov 2023 11:24:12 +0000 (12:24 +0100)]
xen/sched: fix sched_move_domain()

When moving a domain out of a cpupool running with the credit2
scheduler and having multiple run-queues, the following ASSERT() can
be observed:

(XEN) Xen call trace:
(XEN)    [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7
(XEN)    [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1
(XEN)    [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b
(XEN)    [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35
(XEN)    [<ffff82d040206513>] S domain_kill+0xa5/0x116
(XEN)    [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951
(XEN)    [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143
(XEN)    [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9
(XEN)    [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159
(XEN) ****************************************

This is happening as sched_move_domain() is setting a different cpu
for a scheduling unit without telling the scheduler. When this unit is
removed from the scheduler, the ASSERT() will trigger.

In non-debug builds the result is usually a clobbered pointer, leading
to another crash a short time later.

Fix that by swapping the two involved actions (setting another cpu and
removing the unit from the scheduler).

Link: https://github.com/Dasharo/dasharo-issues/issues/488
Fixes: 70fadc41635b ("xen/cpupool: support moving domain between cpupools with different granularity")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
master commit: 4709ec82917668c2df958ef91b4f21c049c76bee
master date: 2023-11-20 10:49:29 +0100

17 months agoxen/grant: Fix build in PV_SHIM
Andrew Cooper [Tue, 14 Nov 2023 17:20:41 +0000 (17:20 +0000)]
xen/grant: Fix build in PV_SHIM

There was a variable name changed which wasn't accounted for in the backport.

Fixes: 267ac3c5921e ("x86/pv-shim: fix grant table operations for 32-bit guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
17 months agox86/spec-ctrl: Add SRSO whitepaper URL
Andrew Cooper [Tue, 14 Nov 2023 13:01:59 +0000 (14:01 +0100)]
x86/spec-ctrl: Add SRSO whitepaper URL

... now that it exists in public.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 78a86b26868c12ae1cc3dd2a8bb9aa5eebaa41fd
master date: 2023-11-07 17:47:34 +0000

17 months agox86/i8259: do not assume interrupts always target CPU0
Roger Pau Monné [Tue, 14 Nov 2023 13:01:33 +0000 (14:01 +0100)]
x86/i8259: do not assume interrupts always target CPU0

Sporadically we have seen the following during AP bringup on AMD platforms
only:

microcode: CPU59 updated from revision 0x830107a to 0x830107a, date = 2023-05-17
microcode: CPU60 updated from revision 0x830104d to 0x830107a, date = 2023-05-17
CPU60: No irq handler for vector 27 (IRQ -2147483648)
microcode: CPU61 updated from revision 0x830107a to 0x830107a, date = 2023-05-17

This is similar to the issue raised on Linux commit 36e9e1eab777e, where they
observed i8259 (active) vectors getting delivered to CPUs different than 0.

On AMD or Hygon platforms adjust the target CPU mask of i8259 interrupt
descriptors to contain all possible CPUs, so that APs will reserve the vector
at startup if any legacy IRQ is still delivered through the i8259.  Note that
if the IO-APIC takes over those interrupt descriptors the CPU mask will be
reset.

Spurious i8259 interrupt vectors however (IRQ7 and IRQ15) can be injected even
when all i8259 pins are masked, and hence would need to be handled on all CPUs.

Continue to reserve PIC vectors on CPU0 only, but do check for such spurious
interrupts on all CPUs if the vendor is AMD or Hygon.  Note that once the
vectors get used by devices detecting PIC spurious interrupts will no longer be
possible, however the device driver should be able to cope with spurious
interrupts.  Such PIC spurious interrupts occurring when the vector is in use
by a local APIC routed source will lead to an extra EOI, which might
unintentionally clear a different vector from ISR.  Note this is already the
current behavior, so assume it's infrequent enough to not cause real issues.

Finally, adjust the printed message to display the CPU where the spurious
interrupt has been received, so it looks like:

microcode: CPU1 updated from revision 0x830107a to 0x830107a, date = 2023-05-17
cpu1: spurious 8259A interrupt: IRQ7
microcode: CPU2 updated from revision 0x830104d to 0x830107a, date = 2023-05-17

Amends: 3fba06ba9f8b ('x86/IRQ: re-use legacy vector ranges on APs')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 87f37449d586b4d407b75235bb0a171e018e25ec
master date: 2023-11-02 10:50:59 +0100

17 months agox86/x2apic: remove usage of ACPI_FADT_APIC_CLUSTER
Roger Pau Monné [Tue, 14 Nov 2023 13:01:07 +0000 (14:01 +0100)]
x86/x2apic: remove usage of ACPI_FADT_APIC_CLUSTER

The ACPI FADT APIC_CLUSTER flag mandates that when the interrupt delivery is
Logical mode APIC must be configured for Cluster destination model.  However in
apic_x2apic_probe() such flag is incorrectly used to gate whether Physical mode
can be used.

Since Xen when in x2APIC mode only uses Logical mode together with Cluster
model completely remove checking for ACPI_FADT_APIC_CLUSTER, as Xen always
fulfills the requirement signaled by the flag.

Fixes: eb40ae41b658 ('x86/Kconfig: add option for default x2APIC destination mode')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 26a449ce32cef33f2cb50602be19fcc0c4223ba9
master date: 2023-11-02 10:50:26 +0100

17 months agox86/pv-shim: fix grant table operations for 32-bit guests
David Woodhouse [Tue, 14 Nov 2023 13:00:37 +0000 (14:00 +0100)]
x86/pv-shim: fix grant table operations for 32-bit guests

When switching to call the shim functions from the normal handlers, the
compat_grant_table_op() function was omitted, leaving it calling the
real grant table operations in !PV_SHIM_EXCLUSIVE builds. This leaves a
32-bit shim guest failing to set up its real grant table with the parent
hypervisor.

Fixes: e7db635f4428 ("x86/pv-shim: Don't modify the hypercall table")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93ec30bc545f15760039c23ee4b97b80c0b3b3b3
master date: 2023-10-31 16:10:14 +0000

17 months agox86/mem_sharing: add missing m2p entry when mapping shared_info page
Tamas K Lengyel [Tue, 14 Nov 2023 13:00:20 +0000 (14:00 +0100)]
x86/mem_sharing: add missing m2p entry when mapping shared_info page

When mapping in the shared_info page to a fork the m2p entry wasn't set
resulting in the shared_info being reset even when the fork reset was called
with only reset_state and not reset_memory. This results in an extra
unnecessary TLB flush.

Fixes: 1a0000ac775 ("mem_sharing: map shared_info page to same gfn during fork")
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23eb39acf011ef9bbe02ed4619c55f208fbcd39b
master date: 2023-10-31 16:10:14 +0000

17 months agodocs: Fix IOMMU command line docs some more
Andrew Cooper [Tue, 14 Nov 2023 12:59:40 +0000 (13:59 +0100)]
docs: Fix IOMMU command line docs some more

Make the command line docs match the actual implementation, and state that the
default behaviour is selected at compile time.

Fixes: 980d6acf1517 ("IOMMU: make DMA containment of quarantined devices optional")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 850382254b78e07e7ccbf80010c3b43897a158f9
master date: 2023-10-31 15:48:07 +0000