Wei Liu [Tue, 29 Jan 2019 12:54:48 +0000 (12:54 +0000)]
x86/mm: change pl*e to l*t in virt_to_xen_l*e
We will need to have a variable named pl*e when we rewrite
virt_to_xen_l*e. Change pl*e to l*t to reflect better its purpose.
This will make reviewing later patch easier.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyax@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Mon, 28 Jan 2019 18:41:26 +0000 (18:41 +0000)]
x86/mm: make sure there is one exit path for modify_xen_mappings
We will soon need to handle dynamically mapping / unmapping page
tables in the said function.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
---
Changed since v3:
- remove asserts on rc since it never gets changed to anything else.
Wei Liu [Mon, 28 Jan 2019 18:30:47 +0000 (18:30 +0000)]
x86/mm: map_pages_to_xen would better have one exit path
We will soon rewrite the function to handle dynamically mapping and
unmapping of page tables.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
---
Changed since v3:
- remove asserts on rc since rc never gets changed to anything else
- reword commit message
Wei Liu [Mon, 28 Jan 2019 18:10:10 +0000 (18:10 +0000)]
x86/mm: introduce l{1,2}t local variables to modify_xen_mappings
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Mon, 28 Jan 2019 17:54:24 +0000 (17:54 +0000)]
x86/mm: introduce l{1,2}t local variables to map_pages_to_xen
The pl2e and pl1e variables are heavily (ab)used in that function. It
is fine at the moment because all page tables are always mapped so
there is no need to track the life time of each variable.
We will soon have the requirement to map and unmap page tables. We
need to track the life time of each variable to avoid leakage.
Introduce some l{1,2}t variables with limited scope so that we can
track life time of pointers to xen page tables more easily.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Wed, 23 Jan 2019 15:33:07 +0000 (15:33 +0000)]
x86: introduce a new set of APIs to manage Xen page tables
We are going to switch to using domheap page for page tables.
A new set of APIs is introduced to allocate, map, unmap and free pages
for page tables.
The allocation and deallocation work on mfn_t but not page_info,
because they are required to work even before frame table is set up.
Implement the old functions with the new ones. We will rewrite, site
by site, other mm functions that manipulate page tables to use the new
APIs.
Note these new APIs still use xenheap page underneath and no actual
map and unmap is done so that we don't break xen half way. They will
be switched to use domheap and dynamic mappings when usage of old APIs
is eliminated.
No functional change intended in this patch.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
---
Changed since v3:
- const qualify unmap_xen_pagetable_new().
- remove redundant parentheses.
Wei Liu [Wed, 23 Jan 2019 15:17:41 +0000 (15:17 +0000)]
x86: move some xen mm function declarations
They were put into page.h but mm.h is more appropriate.
The real reason is that I will be adding some new functions which
takes mfn_t. It turns out it is a bit difficult to do in page.h.
No functional change.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changed since v3:
- move Xen PTE API declarations next to do_page_walk().
Roger Pau Monne [Tue, 3 Dec 2019 10:33:52 +0000 (11:33 +0100)]
automation: increase tests maximum time from 10s to 30s
10s is too low for the clang tests, this is the output from a clang
test:
(XEN) [ 6.512748] ***************************************************
(XEN) [ 6.513323] SELFTEST FAILURE: CORRECT BEHAVIOR CANNOT BE GUARANTEED
(XEN) [ 6.513891] ***************************************************
(XEN) [ 6.514469] 3... 2... 1...
(XEN) [ 9.520011] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) [ 9.544319] Freed 488kB init memory
--- Xen Test Framework ---
Environment: HVM 32bit (PAE 3 levels)
Hello World
Test result: SUCCESS
(XEN) [ 9.610977] Hardware Dom0 halted: halting machine
As can be seen from the output above booting Xen and the XTF test
takes ~10s, without accounting for the time it takes for QEMU to
initialize.
Increase the timeout to 30s to be on the safe side.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monne [Tue, 3 Dec 2019 10:33:51 +0000 (11:33 +0100)]
automation: add timestamps to Xen tests
Enable Xen timestamps in the automated Xen tests, this is helpful in
order to figure out if Xen is stuck or just slow in the automated
tests.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Tue, 3 Dec 2019 13:15:35 +0000 (14:15 +0100)]
x86/tlbflush: do not toggle the PGE CR4 bit unless necessary
When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.
Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC."
Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.
Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.
Adjust a few types in touched or nearby code at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 3 Dec 2019 13:13:40 +0000 (14:13 +0100)]
gnttab: make sure grant map operations don't skip their IOMMU part
Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):
initial state: act->pin = 0
vCPU A: progress the operation past the dropping of the locks after the
act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)
vCPU B: progress the operation past the dropping of the locks after the
act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)
vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
iommu_legacy_map() invocations get skipped due to non-zero
old_pin
With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().
Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jeff Kubascik [Mon, 25 Nov 2019 20:58:00 +0000 (15:58 -0500)]
xen/arm: initialize vpl011 flag register
The tx/rx fifo flags were not set when the vpl011 is initialized. This
is a problem for certain guests that are operating in polled mode, as a
guest will generally check the rx fifo empty flag to determine if there
is data before doing a read. The result is a continuous spam of the
message "vpl011: Unexpected IN ring buffer empty" before the first valid
character is received. This initializes the flag status register to the
default specified in the PL011 technical reference manual.
Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org>
Yi Sun [Mon, 2 Dec 2019 07:24:48 +0000 (15:24 +0800)]
x86/psr: fix bug which may cause crash
During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN) [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN) [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN) [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN) [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN) [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN) [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN) [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN) [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN) [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************
The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
+------------------+------------------+--------------+
| Data default val | Code default val | MBA throttle |
+------------------+------------------+--------------+
Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Mon, 2 Dec 2019 11:29:46 +0000 (12:29 +0100)]
x86: re-order clang no integrated assembler tests
The tests to check whether the integrated assembler is capable of
building Xen should be performed before testing any assembler
features, or else the feature specific tests would be stale if the
integrated assembler is disabled afterwards.
Fixes: ef286f67787a ('x86: move and fix clang .skip check') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Doug Goldstein <cardoe@cardoe.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julian Tuminaro [Sat, 30 Nov 2019 08:11:18 +0000 (03:11 -0500)]
Fix the KDD_LOG statements to use appropriate format specifier for printing uint64_t
Previous commit in kdd.c had a small issue which lead to warning/error while compiling
on 32-bit systems due to mismatch of type size while doing type cast from uint64_t to
void *
George Dunlap [Fri, 29 Nov 2019 17:24:45 +0000 (17:24 +0000)]
Rationalize max_grant_frames and max_maptrack_frames handling
Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.
Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:
- The python libxc bindings hard-coded these values to 32 and 1024,
respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
calculation for grant frames: either 32 or 64, based on the max
possible mfn.
These defaults interact poorly with the hypervisor command-line limit:
- The hypervisor command-line limit cannot be used to raise the limit
for all guests anymore, as the default in the toolstack will
effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
then the "default" values generated by the toolstack are too high,
and all guest creations will fail.
In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.
In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.
This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values. It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.
NOTE: - The Ocaml bindings require the caller to always specify a value,
and the code to start a xenstored stubdomain hard-codes these to 4
and 128 respectively; this behavour will not be modified.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
kdd.c: Add support for initial handshake in KD protocol for Win 7, 8 and 10 (64 bit)
Current implementation of find_os is based on the hard-coded values for
different Windows version. It uses the value for get the address to
start looking for DOS header in the given specified range. However, this
is not scalable to all version of Windows as it will require us to keep
adding new entries and also due to KASLR, chances of not hitting the PE
header is significant. We implement a way for 64-bit systems to use IDT
entry to get a valid exception/interrupt handler and then move back into
the memory to find the valid DOS header. Since IDT entries are protected
by PatchGuard, we think our assumption that IDT entries will not be
corrupted is valid for our purpose. Once we have the image base, we
search for the DBGKD_GET_VERSION64 structure type in .data section to
get information required for handshake.
Currently, this is a work in progress feature and current patch only
supports the handshake and memory read/write on 64-bit systems.
NOTE: This is the Updated version of the previous patch submitted
NOTE: This has currently been only tested when debugging was not enabled
on the guest Windows.
Signed-off-by: Jenish Rakholiya <rjenish@cmu.edu> Signed-off-by: Julian Tuminaro <jtuminar@andrew.cmu.edu> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Paul Durrant <paul@xen.org>
Paul Durrant [Fri, 15 Nov 2019 18:59:30 +0000 (18:59 +0000)]
passthrough: simplify locking and logging
Dropping the pcidevs lock between calling device_assigned() and
assign_device() means that the latter has to do the same check as the
former for no obvious gain. Also, since long running operations under
pcidevs lock already drop the lock and return -ERESTART periodically there
is little point in immediately failing an assignment operation with
-ERESTART just because the pcidevs lock could not be acquired (for the
second time, having already blocked on acquiring the lock in
device_assigned()).
This patch instead acquires the lock once for assignment (or test assign)
operations directly in iommu_do_pci_domctl() and thus can remove the
duplicate domain ownership check in assign_device(). Whilst in the
neighbourhood, the patch also removes some debug logging from
assign_device() and deassign_device() and replaces it with proper error
logging, which allows error logging in iommu_do_pci_domctl() to be
removed.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 26 Nov 2019 14:37:27 +0000 (14:37 +0000)]
AMD/IOMMU: Render IO_PAGE_FAULT errors in a more useful manner
Print the PCI coordinates in its common format and use d%u notation for the
domain. As well as printing flags, decode them. IO_PAGE_FAULT is used for
interrupt remapping errors as well as DMA remapping errors.
Paul Durrant [Wed, 27 Nov 2019 17:11:43 +0000 (17:11 +0000)]
x86 / iommu: set up a scratch page in the quarantine domain
This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.
The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.
NOTE: These modifications are restricted to x86 implementations only as
the buggy h/w I am aware of is only used with Xen in an x86
environment. ARM may require similar code but, since I am not
aware of the need, this patch does not modify any ARM implementation.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Julien Grall [Thu, 28 Nov 2019 09:38:28 +0000 (09:38 +0000)]
xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed
A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.
This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.
As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.
We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().
NOTE: - The call to vpmu_destroy() must also be moved from
arch_vcpu_destroy() into domain_relinquish_resources() such that
the reference on the mapped page does not prevent domain_destroy()
(which calls arch_vcpu_destroy()) from being called.
- Whilst it appears that vpmu_arch_destroy() is idempotent it is
by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
flag is cleared at the end of vpmu_arch_destroy().
- This is not an XSA because vPMU is not security supported (see
XSA-163).
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Fri, 29 Nov 2019 16:10:26 +0000 (17:10 +0100)]
x86: move and fix clang .skip check
.skip is only used by x86 code, so place the clang .skip with labels
check in x86/Rules.mk instead of the top level Rules.mk. While there
also fix an issue with it by removing the '\n' which triggers the
following error:
<stdin>:1:31: error: missing terminating '"' character [-Werror,-Winvalid-pp-token]
void _(void) { asm volatile ( ".L0:
^
<stdin>:1:31: error: expected string literal in 'asm'
<stdin>:3:18: error: missing terminating '"' character [-Werror,-Winvalid-pp-token]
.skip (.L1 - .L0)" ); }
^
<stdin>:3:24: error: expected ')'
.skip (.L1 - .L0)" ); }
^
<stdin>:1:29: note: to match this '('
void _(void) { asm volatile ( ".L0:
^
<stdin>:3:24: error: expected '}'
.skip (.L1 - .L0)" ); }
^
<stdin>:1:14: note: to match this '{'
void _(void) { asm volatile ( ".L0:
^
5 errors generated.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Roger Pau Monné <roger.pau@citrix.com> [On FreeBSD and Debian 9.5] Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 Nov 2019 16:10:00 +0000 (17:10 +0100)]
x86: fix clang .macro retention check
There were two problems here: The first closing parentheses got parsed
by make to end the $(call invocation, and the escaping of the quotes
wasn't right either, as there's nowhere they would get un-escaped.
Furthermore there appears to be a puzzling problem with \n getting
expanded to an actual newline too early in some environments. Convert
these to semicolons at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Roger Pau Monné <roger.pau@citrix.com> [On FreeBSD and Debian 9.5] Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 Nov 2019 16:09:16 +0000 (17:09 +0100)]
console: avoid buffer overrun in guest_console_write()
conring_puts() has been requiring a nul-terminated string, which the
local kbuf[] doesn't get set for anymore. Add a length parameter to the
function, just like was done for others, thus allowing embedded nul to
also be read through XEN_SYSCTL_readconsole.
While there drop a stray cast: Both operands of - are already uint32_t.
Jan Beulich [Fri, 29 Nov 2019 16:08:20 +0000 (17:08 +0100)]
console: avoid buffer overflow in guest_console_write()
The switch of guest_console_write()'s second parameter from plain to
unsigned int has caused the function's main loop header to no longer
guard the min_t() use within the function against effectively negative
values, due to the casts hidden inside the macro. Replace by a plain
min(), casting one of the arguments as necessary.
Fixes: ea601ec9995b ("xen/console: Rework HYPERCALL_console_io interface") Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien@xen.org>
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/svm: Write the correct %eip into the outgoing task
The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.
This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).
32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.
In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 19:33:36 +0000 (19:33 +0000)]
x86/svm: Always intercept ICEBP
ICEBP isn't handled well by SVM.
The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it. As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.
To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.
We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).
Unconditionally intercept ICEBP. This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit. This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.
This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/vtx: Fix fault semantics for early task switch failures
The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:
1) Early faults (i.e. ones delivered in the context of the old task) get
delivered with trap semantics, and break restartibility.
2) The addition isn't truncated to 32 bits. In the corner case of a task
switch instruction crossing the 4G->0 boundary taking an early fault (with
trap semantics), a VMEntry failure will occur due to %rip being out of
range.
Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.
For now, pass 0 on the SVM side. This highlights a separate preexisting bug
which will be addressed in the following patch.
While adjusting call sites, drop the unnecessary uint16_t cast.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Thu, 28 Nov 2019 16:47:25 +0000 (17:47 +0100)]
build: provide option to disambiguate symbol names
The .file assembler directives generated by the compiler do not include
any path components (gcc) or just the ones specified on the command line
(clang, at least version 5), and hence multiple identically named source
files (in different directories) may produce identically named static
symbols (in their kallsyms representation). The binary diffing algorithm
used by xen-livepatch, however, depends on having unique symbols.
Make the ENFORCE_UNIQUE_SYMBOLS Kconfig option control the (build)
behavior, and if enabled use objcopy to prepend the (relative to the
xen/ subdirectory) path to the compiler invoked STT_FILE symbols. Note
that this build option is made no longer depend on LIVEPATCH, but merely
defaults to its setting now.
Conditionalize explicit .file directive insertion in C files where it
exists just to disambiguate names in a less generic manner; note that
at the same time the redundant emission of STT_FILE symbols gets
suppressed for clang. Assembler files as well as multiply compiled C
ones using __OBJECT_FILE__ are left alone for the time being.
Since we now expect there not to be any duplicates anymore, also don't
force the selection of the option to 'n' anymore in allrandom.config.
Similarly COVERAGE no longer suppresses duplicate symbol warnings if
enforcement is in effect, which in turn allows
SUPPRESS_DUPLICATE_SYMBOL_WARNINGS to simply depend on
!ENFORCE_UNIQUE_SYMBOLS.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Nov 2019 14:14:03 +0000 (15:14 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack
At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).
The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.
Notes:
- In principle we could get away without the check_eoi_deferral flag.
I've introduced it just to make sure there's as little change as
possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
strictly necessary.
- The new function's name isn't very helpful with its use in
end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
parallel ack_APIC_irq()), but then liked this even less.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Thu, 28 Nov 2019 10:58:25 +0000 (11:58 +0100)]
x86/vmx: always sync PIR to IRR before vmentry
When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.
Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.
Reported-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Joe Jin <joe.jin@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 27 Nov 2019 10:04:30 +0000 (10:04 +0000)]
x86/microcode: refuse to load the same revision ucode
Currently if a user tries to live-load the same or older ucode revision
than CPU already has, he will get a single message in Xen log like:
(XEN) 128 cores are to update their microcode
No actual ucode loading will happen and this situation can be quite
confusing. Fix this by starting ucode update only when the provided
ucode revision is higher than the currently cached one (if any).
This is based on the property that if microcode_cache exists, all CPUs
in the system should have at least that ucode revision.
Additionally, print a user friendly message if no matching or newer
ucode can be found in the provided blob. This also requires ignoring
-ENODATA in AMD-side code, otherwise the message given to the user is:
(XEN) Parsing microcode blob error -61
Which actually means that a ucode blob was parsed fine, but no matching
ucode was found.
Igor Druzhinin [Tue, 26 Nov 2019 17:08:19 +0000 (17:08 +0000)]
AMD/IOMMU: honour IR setting while pre-filling DTEs
IV bit shouldn't be set in DTE if interrupt remapping is not
enabled. It's a regression in behavior of "iommu=no-intremap"
option which otherwise would keep interrupt requests untranslated
for all of the devices in the system regardless of wether it's
described as valid in IVRS or not.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Tue, 26 Nov 2019 15:49:20 +0000 (15:49 +0000)]
docs/xl: Document pci-assignable state
Changesets 319f9a0ba9 ("passthrough: quarantine PCI devices") and ba2ab00bbb ("IOMMU: default to always quarantining PCI devices")
introduced PCI device "quarantine" behavior, but did not document how
the pci-assignable-add and -remove functions act in regard to this.
Rectify this.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Paul Durrant <paul@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:17:45 +0000 (14:17 +0100)]
EFI: fix "efi=attr=" handling
Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().
Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Tue, 26 Nov 2019 13:16:09 +0000 (14:16 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.
This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.
While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:15:01 +0000 (14:15 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
George Dunlap [Tue, 26 Nov 2019 10:32:42 +0000 (10:32 +0000)]
x86: Don't increase ApicIdCoreSize past 7
Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads. This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one. In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.
Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7. (Linux
guests have so far continued to cope.)
A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches. As a short-term fix,
limit this value to 7.
This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change. This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.
Reported-by: Steven Haigh <netwiz@crc.id.au> Reported-by: Andreas Kinzler <hfp@posteo.de> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Fri, 22 Nov 2019 18:52:02 +0000 (18:52 +0000)]
x86/mm: Adjust linear uses / entries when a page loses validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.
Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.
This could happen in one of two places:
1. In the case a partially-devalidated page was re-validated by
someone else
2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.
The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.
Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
libxl: make default path to add/remove all PV devices
Adding/removing device is handled for specific devices only: VBD, VIF,
QDISK. This commit adds default case to handle adding/removing for all PV
devices by default, except QDISK device, which requires special handling.
If any other device is required a special handling it should be done by
implementing separate case (similar to QDISK device). The default
behaviour for adding device is to wait when the backend goes to
XenbusStateInitWait and the default behaviour on removing device is to
start generic device remove procedure.
Also this commit fixes removing guest function: before the guest was
removed when all VIF and VBD devices are removed. The fix removes
guest when all created devices are removed. This is done by checking the
guest device list instead of checking num_vifs and num_vbds. num_vifs and
num_vbds variables are removed as redundant in this case.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two kind of VKBD devices: with QEMU backend and user space PV
backend. In current implementation they can't be distinguished as both use
VKBD backend type. As result, user space PV KBD backend is started and
stopped as QEMU backend. This commit adds new device kind VINPUT to be
used as backend type for user space PV KBD backend.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 18:21:49 +0000 (18:21 +0000)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 5 Nov 2019 19:08:14 +0000 (19:08 +0000)]
x86/livepatch: Prevent patching with active waitqueues
The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.
This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.
In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Fri, 22 Nov 2019 16:52:59 +0000 (17:52 +0100)]
x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode
Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.
This issue was discovered while trying to boot a pvshim with x2APIC
enabled.
Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Julien Grall [Wed, 20 Nov 2019 13:37:51 +0000 (13:37 +0000)]
xen: Add missing va_end() in hypercall_create_continuation()
The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.
This was introduced by XSA-296.
Coverity-ID: 1488727 Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()") Signed-off-by: Julien Grall <julien@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 30 Oct 2019 14:54:47 +0000 (14:54 +0000)]
x86/e820: fix 640k - 1M region reservation logic
Converting a guest from PV to PV-in-PVH makes the guest to have 384k
less memory, which may confuse guest's balloon driver. This happens
because Xen unconditionally reserves 640k - 1M region in E820 despite
the fact that it's really a usable RAM in PVH boot mode.
Fix this by skipping region type change in virtualised environments,
trusting whatever memory map our hypervisor has provided.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 9 Sep 2019 10:43:33 +0000 (11:43 +0100)]
x86/boot: Remove cached CPUID data from the trampoline
We have a cached cpuid_ext_features in the trampoline which is kept in sync by
various pieces of boot logic. This is complicated, and all it is actually
used for is to derive whether NX is safe to use.
Replace it with a canned value to load into EFER.
trampoline_setup() and efi_arch_cpu() now tweak trampoline_efer at the point
that they are stashing the main copy of CPUID data. Similarly,
early_init_intel() needs to tweak if it has re-enabled the use of NX.
This simplifies the AP boot and S3 resume paths by using trampoline_efer
directly, rather than locally turning FEATURE_NX into EFER_NX.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Wed, 20 Nov 2019 16:12:12 +0000 (17:12 +0100)]
x86/Makefile: remove $(guard) use from $(TARGET).efi target
Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.
But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.
Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.
The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.
Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
efi: do not use runtime services table with efi=no-rs
Before dfcccc6631 "efi: use directmap to access runtime services table"
all usages of efi_rs pointer were guarded by efi_rs_enter(), which
implicitly refused to operate with efi=no-rs (by checking if
efi_l4_pgtable is NULL - which is the case for efi=no-rs). The said
commit (re)moved that call as unneeded for just reading content of
efi_rs structure - to avoid unnecessary page tables switch. But it
neglected to check if efi_rs access is legal.
Fix this by adding explicit check for runtime service being enabled in
the cases that do not use efi_rs_enter().
Reported-by: Roman Shaposhnik <roman@zededa.com> Fixes: dfcccc6631 "efi: use directmap to access runtime services table" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
c/s ff66ccefe5 "x86/CPUID: adjust SSEn dependencies" made SSE4A depend on
SSSE3, but these processors really do have have SSE4A without SSSE3.
This manifests as an upgrade regression, as the SSE4A feature disappears from
view.
Adjust the SSE4A feature to depend on SSE3 rather than SSSE3.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
> To embed Python into an application, a new --embed option must be
> passed to python3-config --libs --embed to get -lpython3.8 (link the
> application to libpython). To support both 3.8 and older, try
> python3-config --libs --embed first and fallback to python3-config
> --libs (without --embed) if the previous command fails.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wl@xen.org>
[ wei: rerun autogen.sh ]
Ian Jackson [Tue, 29 Oct 2019 17:45:30 +0000 (17:45 +0000)]
tools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim
The pvshim can only be built 64-bit because the hypervisor is only
64-bit nowadays. The hypervisor build supports XEN_COMPILE_ARCH and
XEN_TARGET_ARCH which override the information from uname. The pvshim
build runs out of the tools/ directory but calls the hypervisor build
system.
If one runs in a Linux 32-bit userland with a 64-bit kernel, one used
to be able to set XEN_COMPILE_ARCH. But nowadays this does not work.
configure sees the target cpu as 64-bit and tries to build pvshim.
The build prints
echo "*** Xen x86/32 target no longer supported!"
and doesn't build anything. Then the subsequent Makefiles try to
install the non-built pieces.
Fix this anomaly by causing configure to honour the Xen hypervisor way
of setting the target architecture.
In principle this user behaviour is not handled quite right, because
configure will still see 64-bit and so all the autoconf-based
architecture testing will see 64-bit rather than 32-bit x86. But the
tools are in fact generally quite portable: this particular location
in configure{.ac,} is the only place in tools/ where 64-bit x86 is
treated differently from 32-bit x86, so the fix is sufficient and
correct for this use case.
It remains the case that XEN_COMPILE_ARCH or XEN_TARGET_ARCH to a
non-x86 architecture, when configure thinks things are x86, or vice
versa, will not work right.
(This is a bugfix to 8845155c831c
pvshim: make PV shim build selectable from configure
which inadvertantly deleted the logic to only build the shim for
XEN_TARGET_ARCH != x86_32.)
I have rerun autogen.sh, so this patch contains the fix to configure
as well as the source fix to configure.ac.
Fixes: 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2 Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> CC: Olaf Hering <olaf@aepfle.de> CC: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Jürgen Groß <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
libxl: gentypes: initialise array elements in json
Currently, array elements are initialized with calloc. Which means
initialize all element fields with zero values. If an entry is not
present in the json (which is entirely permitted), the element will be
all-bits-zero instead of the default value (which is wrong).
The fix is to initalise each array element before parsing it, using
the new libxl_C_type_do_init function.
With existing types this results in a lot of new calls like this:
(indentation adjusted). This looks right. To check what happens with
types which have nontrivial defaults but don't have init functions (of
which we currently have none in arrays), I (Ian) experimentally added:
("pnode", uint32), # physical node of this node
("vcpus", libxl_bitmap), # vcpus in this node
+ ("sporks", Array(MemKB, "num_sporks")),
])
Ian Jackson [Tue, 29 Oct 2019 15:19:33 +0000 (15:19 +0000)]
libxl: gentypes.py: Break out libxl_C_type_do_init
This is going to be the common way to initialise things.
_libxl_C_type_init remains the thing for generating the body of the
init function, and for some special cases.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Ian Jackson [Tue, 29 Oct 2019 15:17:58 +0000 (15:17 +0000)]
libxl: gentypes.py: Break out field_pass in ..._copy_deprecated
We are going to want this in a moment.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Ian Jackson [Tue, 29 Oct 2019 15:00:35 +0000 (15:00 +0000)]
tools/libxl: gentypes.py: Prefer init_val to init_fn
When both are provided, init_val is likely to be more direct.
No functional change with existing types: C output is identical.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Anthony PERARD [Thu, 31 Oct 2019 12:17:27 +0000 (12:17 +0000)]
libxl_pci: Don't hold QMP connection while waiting
After sending the 'device_del' command for a PCI passthrough device,
we wait until QEMU has effectively deleted the device, this involves
executing more QMP commands. While waiting, libxl hold the connection.
It isn't necessary to hold the connection and it prevents others from
making progress, so this patch releases the QMP connection.
For background:
e.g., when a guest is created with several pci passthrough
attached, on `xl destroy` all the devices needs to be detach, and
this is usually what happens:
- 'device_del' called for the 1st pci device
- 'query-pci' checking if pci still there, it is
- wait 1s
- 'query-pci' checking again, and it's gone
-> now the same can be done for the second pci device, so
plenty of waiting on others when pci detach can be done in
parallel.
On shutdown, libxl usually keeps waiting because QEMU never
releases the device because the guest kernel never responds QEMU's
unplug queries. So detaching of the 1st device waits until a
timeout stops it, and since the same timeout is setup at the same
time for the other devices to detach, the 'device_del' command is
never sent for those.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:08 +0000 (17:13 +0000)]
libxl_qmp: Have a lock for QMP socket access
This patch workaround the fact that it's not possible to connect
multiple time to a single QMP socket. QEMU listen on the socket with
a backlog value of 1, which mean that on Linux when concurrent thread
call connect() on the socket, they get EAGAIN.
Background:
This happens when attempting to create a guest with multiple
pci devices passthrough, libxl creates one connection per device to
attach and execute connect() on all at once before any single
connection has finished.
To work around this, we use a new lock.
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 18:10:14 +0000 (18:10 +0000)]
libxl: Introduce libxl__ev_immediate
This new ev allows to arrange a non-reentrant callback to be called.
This happen immediately after the current event is processed and after
other ev_immediates that would have already been registered.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:06 +0000 (17:13 +0000)]
libxl: libxl__ev_qmp_send now takes an egc
No functionnal changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:05 +0000 (17:13 +0000)]
libxl: Introduce libxl__ev_slowlock_dispose
Which allow to cancel the lock operation while it is in Active state.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:04 +0000 (17:13 +0000)]
libxl: Rename ev_devlock to ev_slowlock
We are going to introduce a different lock based on the same
implementation as the ev_devlock but with a different path. The
different slowlock will be differentiated by calling different _init()
functions.
So we rename libxl__ev_devlock to lib__ev_slowlock, but keep
libxl__ev_devlock_init().
Some log messages produced ev_slowlock are changed to print the
name of the lock file (userdata_userid).
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:03 +0000 (17:13 +0000)]
libxl: Move libxl__ev_devlock declaration
We are going to want to include libxl__ev_devlock into libxl__ev_qmp.
No functional changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Mon, 18 Nov 2019 17:13:02 +0000 (17:13 +0000)]
libxl: Introduce libxl__ev_child_kill_deregister
Allow to deregister the callback associated with a child death event.
The death isn't immediate will need to be collected later, so the
ev_child machinery register its own callback.
libxl__ev_child_kill_deregister() might be called by an AO operation
that is finishing/cleaning up without a chance for libxl to be
notified of the child death (via SIGCHLD). So it is possible that the
application calls libxl_ctx_free() while there are still child around.
To avoid the application getting unexpected SIGCHLD, the libxl__ao
responsible for killing a child will have to wait until it has been
properly reaped.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Fri, 15 Nov 2019 13:18:16 +0000 (14:18 +0100)]
x86: fix race to build arch/x86/efi/relocs-dummy.o
With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.
Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.
Have only one makefile responsible for building relocs-dummy.o.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Fri, 15 Nov 2019 13:17:26 +0000 (14:17 +0100)]
AMD/IOMMU: restore DTE fields in amd_iommu_setup_domain_device()
Commit 1b00c16bdf ("AMD/IOMMU: pre-fill all DTEs right after table
allocation") moved ourselves into a more secure default state, but
didn't take sufficient care to also undo the effects when handing a
previously disabled device back to a(nother) domain. Put the fields
that may have been changed elsewhere back to their intended values
(some fields amd_iommu_disable_domain_device() touches don't
currently get written anywhere else, and hence don't need modifying
here).
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Fri, 15 Nov 2019 13:15:31 +0000 (14:15 +0100)]
x86emul: 16-bit XBEGIN does not truncate rIP
SDM rev 071 points out this fact explicitly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Wed, 13 Nov 2019 18:11:17 +0000 (18:11 +0000)]
xen/sched: Render sibling/core masks with %pbl to improve 'r' debugkey
For system with large numbers of CPUs, the 'r' debugkey is unwieldy. Sibling
and core masks are a single block of adjacent bits, so are vastly shorter to
render with %pbl.