Tamas K Lengyel [Thu, 25 Apr 2019 15:32:50 +0000 (09:32 -0600)]
x86/mem_sharing: aquire extra references for pages with correct domain
Patch 0502e0adae2 "x86: correct instances of PGC_allocated clearing" introduced
grabbing extra references for pages that drop references tied to PGC_allocated.
However, these pages are actually owned by dom_cow, resulting both sharing and
unsharing breaking.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 23 Apr 2019 15:18:29 +0000 (16:18 +0100)]
xen/timers: Fix memory leak with cpu unplug/plug (take 2)
Previous attempts to fix this leak didn't identify the root cause, and
ultimately failed. The cause is actually the CPU_UP_PREPARE case
(re)initialising ts->heap back to dummy_heap, which leaks the previous
allocation.
Rearrange the logic to only initialise ts once. This also avoids the
redundant (but benign, due to ts->inactive always being empty) initialising of
the other ts fields.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 24 Apr 2019 17:53:15 +0000 (18:53 +0100)]
xen/domain: Block more speculative out-of-bound accesses
c/s f8303458 restricted speculative access for do_vcpu_op(), but neglected its
compat counterpart, which is reachable by guests using the 32bit ABI.
Make an identical adjustment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 26 May 2016 16:37:30 +0000 (17:37 +0100)]
x86/shadow: Drop incorrect diagnostic when shadowing TSS.RSP0
During development of the XTF pagewalk tests, I reliably encountered this
message exactly once per run. It occurs when the first action to touch
TSS.RSP0 is an interrupt/exception taken in userspace, and the processor tries
to push the IRET frame.
Subsequently, OSSTest has demonstrated that it triggers frequently for a
KPTI-enabled kernel.
(XEN) multi.c:3324:d1v1 write to pagetable during event injection: cr2=0xffffad2646687f38, mfn=0x2415a1
[ 1411.949155] systemd-logind[2683]: New session 73 of user root.
(XEN) multi.c:3324:d1v1 write to pagetable during event injection: cr2=0xffffad264671ff38, mfn=0x240a41
(XEN) multi.c:3324:d1v1 write to pagetable during event injection: cr2=0xffffad2646837f38, mfn=0x2415c5
(XEN) multi.c:3324:d1v1 write to pagetable during event injection: cr2=0xffffad26468a7f38, mfn=0x2414e7
[ 1442.207473] systemd-logind[2683]: New session 74 of user root.
[ 1471.452206] systemd-logind[2683]: New session 75 of user root.
(XEN) multi.c:3324:d1v1 write to pagetable during event injection: cr2=0xffffad2646d17f08, mfn=0x2417c5
[ 1501.698971] systemd-logind[2683]: New session 76 of user root.
The actions performed by the shadow code are correct, and the guest continues
without error, but the emitted error is misleading. Tweak the comment to more
clearly identify why the condition exists, but drop the message.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Fri, 1 Feb 2019 14:48:48 +0000 (14:48 +0000)]
x86/svm: Fix handling of ICEBP intercepts
c/s 9338a37d "x86/svm: implement debug events" added support for introspecting
ICEBP debug exceptions, but didn't account for the fact that
svm_get_insn_len() (previously __get_instruction_length) can fail and may
already have raised #GP with the guest.
If svm_get_insn_len() fails, return back to guest context rather than
continuing and mistaking a trap-style VMExit for a fault-style one.
Spotted by Coverity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Brian Woods <brian.woods@amd.com>
xen/sched: we never get into context_switch() with prev==next
In schedule(), if we pick, as the next vcpu to run (next) the same one
that is running already (prev), we never get to call context_switch().
We can, therefore, get rid of all the `if`-s testing prev and next being
different, trading them with an ASSERT() (on ARM, the ASSERT() was even
already there!)
Suggested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrii Anisov <andrii_anisov@epam.com>
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (13:26 +0100)]
x86/boot: Detect the firmware SMT setting correctly on Intel hardware
While boot_cpu_data.x86_num_siblings is an accurate value to use on AMD
hardware, it isn't on Intel when the user has disabled Hyperthreading in the
firmware. As a result, a user which has chosen to disable HT still gets
nagged on L1TF-vulnerable hardware when they haven't chosen an explicit
smt=<bool> setting.
Make use of the largely-undocumented MSR_INTEL_CORE_THREAD_COUNT which in
practice exists since Nehalem, when booting on real hardware. Fall back to
using the ACPI table APIC IDs.
While adjusting this logic, fix a latent bug in amd_get_topology(). The
thread count field in CPUID.0x8000001e.ebx is documented as 8 bits wide,
rather than 2 bits wide.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (12:26 +0000)]
x86/msr: Definitions for MSR_INTEL_CORE_THREAD_COUNT
This is a model specific register which details the current configuration
cores and threads in the package. Because of how Hyperthread and Core
configuration works works in firmware, the MSR it is de-facto constant and
will remain unchanged until the next system reset.
It is a read only MSR (so unilaterally reject writes), but for now retain its
leaky-on-read properties. Further CPUID/MSR work is required before we can
start virtualising a consistent topology to the guest, and retaining the old
behaviour is the safest course of action.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 5 Apr 2019 15:58:44 +0000 (15:58 +0000)]
x86/boot: Don't leak the module_map allocation in __start_xen()
Ever since its introducion in c/s 436fb462 "x86/microcode: enable boot
time (pre-Dom0) loading", the allocation has gone un-freed, and has its final
use as part of constructing dom0.
Xen already consideres it an error to have more than a single unaccounted-for
module (again, logic from the same change), and will only pass the first one
to dom0 as the initrd.
Instead of having an 8 byte pointer to a bitmap which won't exceed 4 bits wide
in any production scenario (dom0 kernel, initrd, XSM blob and microcode blob),
allocate module_map[] on the stack and add a sanity bound for mbi->mods_count.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
The limit 1900x1200 do not match real world devices (1900 looks like a
typo, should be 1920). But in practice the limits are arbitrary and do
not serve any real purpose. As discussed in "Increase framebuffer size
to todays standards" thread, drop them completely.
This fixes graphic console on device with 3840x2160 native resolution.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
MAX_BPP, MAX_FONT_W, MAX_FONT_H are not used in the code at all.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
When bitmap_fill(..., 0) is called, do not try to write anything. Before
this patch, it tried to write almost LONG_MAX, surely overwriting
something.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 13 May 2019 07:58:57 +0000 (09:58 +0200)]
page-alloc: detect double free earlier
Right now this goes unnoticed until some subsequent page allocator
operation stumbles across the thus corrupted list. We can do better:
Only PGC_state_inuse and PGC_state_offlining pages can legitimately be
passed to free_heap_pages().
Take the opportunity and also restrict the PGC_broken check to the
PGC_state_offlining case, as only pages of that type or
PGC_state_offlined may have this flag set on them. Similarly, since
PGC_state_offlined is not a valid input state, the setting of "tainted"
can be restricted to just this case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Eslam Elnikety [Mon, 13 May 2019 07:58:08 +0000 (09:58 +0200)]
mm: option to _always_ scrub freed domheap pages
Give the administrator further control on when to scrub domheap pages by adding
an option to always scrub. This is a safety feature that, when enabled,
prevents a (buggy) domain from leaking secrets if it accidentally frees a page
without proper scrubbing.
Signed-off-by: Eslam Elnikety <elnikety@amazon.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Tamas K Lengyel [Mon, 13 May 2019 07:55:59 +0000 (09:55 +0200)]
x86/vmx: correctly gather gs_shadow value for current vCPU
Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.
Refresh shadow_gs value when the context being saved is for the current vCPU.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Alexandru Isaila [Mon, 13 May 2019 07:55:24 +0000 (09:55 +0200)]
x86/altp2m: aggregate get entry and populate into common funcs
The code for getting the entry and then populating was repeated in
p2m_change_altp2m_gfn() and in p2m_set_altp2m_mem_access().
The code is now in one place with a bool param that lets the caller choose
if it populates after get_entry().
If remapping is being done then both the old and new gfn's should be
unshared in the hostp2m for keeping things consistent. The page type
of old_gfn was already checked whether it's p2m_ram_rw and bail if it
wasn't so functionality-wise this just simplifies things as a user
doesn't have to request unsharing manually before remapping.
Now, if the new_gfn is invalid it shouldn't query the hostp2m as
that is effectively a request to remove the entry from the altp2m.
But provided that scenario is used only when removing entries that
were previously remapped/copied to the altp2m, those entries already
went through P2M_ALLOC | P2M_UNSHARE before, so it won't have an
affect so the core function get_altp2m_entry() is calling
__get_gfn_type_access() with P2M_ALLOC | P2M_UNSHARE.
altp2m_get_entry_direct() is also called in p2m_set_suppress_ve()
because on a new altp2m view the function will fail with invalid mfn if
p2m->set_entry() was not called before.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Igor Druzhinin [Mon, 13 May 2019 07:54:45 +0000 (09:54 +0200)]
x86/mtrr: recalculate P2M type for domains with iocaps
This change reflects the logic in epte_get_entry_emt() and allows
changes in guest MTTRs to be reflected in EPT for domains having
direct access to certain hardware memory regions but without IOMMU
context assigned (e.g. XenGT).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 13 May 2019 07:52:43 +0000 (09:52 +0200)]
AMD/IOMMU: disable previously enabled IOMMUs upon init failure
If any IOMMUs were successfully initialized before encountering failure,
the successfully enabled ones should be disabled again before cleaning
up their resources.
Move disable_iommu() next to enable_iommu() to avoid a forward
declaration, and take the opportunity to remove stray blank lines ahead
of both functions' final closing braces.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Jan Beulich [Mon, 13 May 2019 07:51:23 +0000 (09:51 +0200)]
trace: fix build with gcc9
While I've not observed this myself, gcc 9 (imo validly) reportedly may
complain
trace.c: In function '__trace_hypercall':
trace.c:826:19: error: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Werror=address-of-packed-member]
826 | uint32_t *a = d.args;
and the fix is rather simple - remove the __packed attribute. Introduce
a BUILD_BUG_ON() as replacement, for the unlikely case that Xen might
get ported to an architecture where array alignment higher that that of
its elements.
Reported-by: Martin Liška <martin.liska@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Ian Jackson [Mon, 15 Apr 2019 16:13:09 +0000 (17:13 +0100)]
build system: make install-stubdom depend on install-tools again
In d290e325179ccee966cd679d0fed48be6f4cc1b7
"build system: don't let install-stubdom depend on install-tools"
the dependency of install-stubdom on install-tools was removed.
However, this was not correct. stubdom/Makefile contains this:
With recursive make, it is necessary for the overall structure of the
makefiles to sequence things so that each directory is entered exactly
once, before its dependent directories are entered. (It is possible
to violate this rule without creating races but it is tricky and
inadvisable.)
Since d290e325179c, it can happen that the command for the
qemu-xen-traditional-dir-find rule is run twice simultaneously - once
as a result of $(MAKE) -C tools install, and once as a result of
$(MAKE) -C stubdom install. If you get unlucky, this causes lossage.
(This just happened to me in an osstest flight.)
In principle we could alternatively fix this by lifting the commands
in the qemu-xen-traditional-dir-find target (and perhaps other things
too) into the toplevel Makefile, as was done for mini-os.
But that seems overkill given how bad the stubdom build system is, and
the fact that we think at some point this qemu-trad will go away
entirely. Adding the tools dependency back to the stubdom build is
by and large good enough.
(Someone who really wants to build stubdom without tools is welcome to
do this separation if they really want to.)
CC: Juergen Gross <jgross@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
* Fix the shim build by providing a !CONFIG_HVM declaration for
hvm_get_guest_bndcfgs(), and removing the introduced
ASSERT(is_hvm_domain(d))'s. They are needed for DCE to keep the build
working. Furthermore, in this way, the risk of runtime type confusion is
removed.
* Revert the de-const'ing of the vcpu pointer in vmx_get_guest_bndcfgs().
vmx_vmcs_enter() really does mutate the vcpu, and may cause it to undergo a
full de/reschedule, which is contrary to the programmers expectation of
hvm_get_guest_bndcfgs(). guest_rdmsr() was always going to need to lose
its const parameter, and this was the correct time for it to happen.
* The MSRs in vcpu_msrs are in numeric order. Re-position XSS to match.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 11 Apr 2019 10:45:41 +0000 (04:45 -0600)]
timers: move back migrate_timers_from_cpu() invocation
Commit 597fbb8be6 ("xen/timers: Fix memory leak with cpu unplug/plug")
went a little too far: Migrating timers away from a CPU being offlined
needs to heppen independent of whether it get parked or fully offlined.
Jan Beulich [Thu, 11 Apr 2019 08:25:05 +0000 (10:25 +0200)]
x86: fix build race when generating temporary object files
The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.
The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.
Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/mm: Clean up p2m_finish_type_change return value
In the case of any errors, finish_type_change() passes values returned
from p2m->recalc() up the stack (with some exceptions in the case where
an error is expected); this eventually ends up being returned to the
XEN_DOMOP_map_mem_type_to_ioreq_server hypercall.
However, on Intel processors (but not on AMD processor), p2m->recalc()
can also return '1' as well as '0'. This case is handled very
inconsistently: finish_type_change() will return the value of the final
entry it attempts, discarding results for other entries;
p2m_finish_type_change() will attempt to accumulate '1's, so that it
returns '1' if any of the calls to finish_type_change() returns '1'; and
dm_op() will again return '1' only if the very last call to
p2m_finish_type_change() returns '1'. The result is that the
XEN_DMOP_map_mem_type_to_ioreq_server() hypercall will sometimes return
0 and sometimes return 1 on success, in an unpredictable manner.
The hypercall documentation doesn't mention return values; but it's not
clear what the caller could do with the information about whether
entries had been changed or not. At the moment it's always 0 on AMD
boxes, and *usually* 1 on Intel boxes; so nothing can be relying on a
'1' return value for correctness (or if it is, it's broken).
Make the return value on success consistently '0' by only returning
0/-ERROR from finish_type_change(). Also remove the accumulation code
from p2m_finish_type_change().
Suggested-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 5 Apr 2019 14:59:27 +0000 (15:59 +0100)]
x86/hvm: Fix altp2m_op hypercall continuations
c/s 9383de210 "x86/altp2m: support for setting restrictions for an array of
pages" introduced this logic, but do_hvm_op() was already capable of handling
-ERESTART correctly.
More problematic however is a continuation from compat_altp2m_op(). The arg
written back into register state points into the hypercall XLAT area, not at
the original parameter passed by the guest. It may be truncated by the
vmentry, but definitely won't be correct on the next invocation.
Delete the hypercall_create_continuation() call, and return -ERESTART, which
will cause the compat case to start working correctly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 28 Mar 2019 14:37:00 +0000 (14:37 +0000)]
x86/smt: Support for enabling/disabling SMT at runtime
Currently, a user can in principle combine the output of `xl info -n`, the
APCI tables, and some manual CPUID data to figure out which CPU numbers to
feed into `xen-hptool cpu-offline` to effectively disable SMT at runtime.
A more convenient option is to teach Xen how to perform this action.
Extend XEN_SYSCTL_cpu_hotplug with two new operations. Introduce a new
smt_up_down_helper() which wraps the cpu_{up,down}_helper() helpers with logic
which understands siblings based on their APIC_ID.
Add libxc stubs, and extend xen-hptool with smt-{enable,disable} options.
These are intended to be shorthands for a loop over cpu-{online,offline}.
To simplify the implemention, they will strictly enable/disable secondary
siblings (those with a non-zero thread id). This functionality is intended
for use in production scenarios where debugging options such as `maxcpus=` or
other manual plug/unplug configuration has not been used.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Igor Druzhinin [Tue, 9 Apr 2019 12:01:58 +0000 (13:01 +0100)]
tools/xl: use libxl_domain_info to get domain type for vcpu-pin
Parsing the config seems to be an overkill for this particular task
and the config might simply be absent. Type returned from libxl_domain_info
should be either LIBXL_DOMAIN_TYPE_HVM or LIBXL_DOMAIN_TYPE_PV but in
that context distinction between PVH and HVM should be irrelevant.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Anthony PERARD [Fri, 5 Apr 2019 17:58:11 +0000 (18:58 +0100)]
libxl: Document device_add_domain_config
Commit 03e1a56d81c16eece735e4d0ef74bfb10eaaba07 replaced DEVICE_ADD()
calls by device_add_domain_config() calls but also removed the comment
of DEVICE_ADD(). Copy the useful part of that comment to
device_add_domain_config().
Also, rename the parameter `type` to `dev`, because that parameter isn't
used as a type but as the device we want to add/update to d_config.
Also, constify `dev` because it isn't modified.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
This can't be folded into the resume hook, as that runs before bringing
back up APs, but the affinity adjustment wants to happen with all CPUs
back online. Hence a separate hook is needed such that AMD can then
leverage it as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Initially I had just noticed the unnecessary indirection in the call
from pi_update_irte(). The generic wrapper having an iommu_intremap
conditional made me look at the setup code though. So first of all
enforce the necessary dependency.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 14 Mar 2019 13:55:00 +0000 (14:55 +0100)]
x86: stop handling MSR_IA32_XSS save/restore in implementation code
Saving and restoring the value of this MSR is currently handled by
implementation-specific code despite it being architectural. This patch
moves handling of accesses to this MSR from hvm.c into the msr.c, thus
allowing the common MSR save/restore code to handle it.
This patch also adds proper checks of CPUID policy in the new get/set code.
NOTE: MSR_IA32_XSS is the last MSR to be saved and restored by
implementation-specific code. This patch therefore removes the
(VMX) definitions and of the init_msr(), save_msr() and
load_msr() hvm_funcs, as they are no longer necessary. The
declarations of and calls to those hvm_funcs will be cleaned up
by a subsequent patch.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Paul Durrant [Thu, 14 Mar 2019 13:56:00 +0000 (14:56 +0100)]
x86: stop handling MSR_IA32_BNDCFGS save/restore in implementation code
Saving and restoring the value of this MSR is currently handled by
implementation-specific code despite it being architectural. This patch
moves handling of accesses to this MSR from hvm.c into the msr.c, thus
allowing the common MSR save/restore code to handle it.
NOTE: Because vmx_get/set_guest_bndcfgs() call vmx_vmcs_enter(), the
struct vcpu pointer passed in, and hence the vcpu pointer passed to
guest_rdmsr() cannot be const.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Julien Grall [Wed, 27 Mar 2019 18:45:23 +0000 (18:45 +0000)]
xen/arm: memaccess: Initialize correctly *access in __p2m_get_mem_access
The commit 8d84e701fd "xen/arm: initialize access" initializes
*access using the wrong enumeration type. This result to a warning
using clang:
mem_access.c:50:20: error: implicit conversion from enumeration type
'p2m_access_t' to different enumeration type 'xenmem_access_t'
[-Werror,-Wenum-conversion]
*access = p2m->default_access;
~ ~~~~~^~~~~~~~~~~~~~
The correct solution is to use the array memaccess that will do the
conversion between the 2 enums.
xen/console: Properly buffer domU output when using CONSOLEIO_write
The output will be buffered if the buffer provided by the DomU does not
contain a newline. This can also happen if buffer provided by DomU is
split in multiple part (Xen can only process 127 characters at the time).
As Xen will remove any non-printable characters, the output buffer may
be smaller than the buffer provided. However, Xen will buffer using the
original length. This means that the NUL character and garbagge will be
copied in the internal buffer.
Once the newline is found or the internal buffer is full, only part of
the internal buffer will end up to be printed.
Igor Druzhinin [Thu, 4 Apr 2019 16:25:10 +0000 (17:25 +0100)]
x86/vmx: Fixup removals of MSR load/save list entries
Commit 540d5422 ("x86/vmx: Support removing MSRs from the host/guest
load/save lists") introduced infrastructure finally exposed by
commit fd32dcfe ("x86/vmx: Don't leak EFER.NXE into guest context")
that led to a functional regression on Harpertown and earlier cores
(Gen 1 VT-x) due to MSR count being incorrectly set in VMCS.
As the result, as soon as guest EFER becomes equal to Xen EFER
(which eventually happens in almost every 64-bit VM) and its MSR
entry is supposed to be removed, a stale version of EFER is loaded
into a guest instead causing almost immediate guest failure.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
c/s 597fbb8 "xen/timers: Fix memory leak with cpu unplug/plug" broke the ARM
build by being the first patch to add park_offline_cpus to common code.
While it is currently specific to Intel hardware (for reasons of being able to
handle machine check exceptions without an immediate system reset), it isn't
inherently architecture specific, so define it to be false on ARM for now.
Add a comment in both smp.h headers explaining the intended behaviour of the
option.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
This is the ABI for the two halves of a para-virtualized
camera driver which extends Xen's reach multimedia capabilities even
farther enabling it for video conferencing, In-Vehicle Infotainment,
high definition maps etc.
The initial goal is to support most needed functionality with the
final idea to make it possible to extend the protocol if need be:
1. Provide means for base virtual device configuration:
- pixel formats
- resolutions
- frame rates
2. Support basic camera controls:
- contrast
- brightness
- hue
- saturation
3. Support streaming control
Jan Beulich [Mon, 8 Apr 2019 11:08:05 +0000 (13:08 +0200)]
x86/IOMMU: initialize iommu_ops in vendor-independent code
Move this into iommu_hardware_setup() and make that function non-
inline. Move its declaration into common code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Jan Beulich [Mon, 8 Apr 2019 11:04:23 +0000 (13:04 +0200)]
x86/IOMMU: introduce init-ops structure
Do away with the CPU vendor dependency, and set the init ops pointer
based on which ACPI tables have been found.
Also take the opportunity and add __read_mostly to iommu_ops.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Brian Woods <brian.woods@amd.com>
Jan Beulich [Mon, 8 Apr 2019 11:03:07 +0000 (13:03 +0200)]
x86/ACPI: also parse AMD IOMMU tables early
In order to be able to initialize x2APIC mode we need to parse
respective ACPI tables early. Split amd_iov_detect() into two parts for
this purpose, and call the initial part earlier on.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
xen/arm: Cap the number of interrupt lines for dom0
Dom0 vGIC will use the same number of interrupt lines as the hardware GIC.
While the hardware GIC can support up to 1020 interrupt lines,
the vGIC is only supporting up to 992 interrupt lines.
This means that Xen will not be able to boot on platforms where the hardware
GIC supports more than 992 interrupt lines.
While it would make sense to increase the limits in the vGICs, this is not
trivial because of the design choices.
At the moment, only models seem to report the maximum of interrupt lines.
They also do not have any interrupt wired above the 992 limit.
So it should be fine to cap the number of interrupt lines for dom0 to 992 lines.
Andrew Cooper [Fri, 29 Mar 2019 16:17:24 +0000 (16:17 +0000)]
xen/timers: Fix memory leak with cpu unplug/plug
timer_softirq_action() realloc's itself a larger timer heap whenever
necessary, which includes bootstrapping from the empty dummy_heap. Nothing
ever freed this allocation.
CPU plug and unplug has the side effect of zeroing the percpu data area, which
clears ts->heap. This in turn causes new timers to be put on the list rather
than the heap, and for timer_softirq_action() to bootstrap itself again.
This in practice leaks ts->heap every time a CPU is unplugged and replugged.
Implement free_percpu_timers() which includes freeing ts->heap when
appropriate, and update the notifier callback with the recent cpu parking
logic and free-avoidance across suspend.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 26 Mar 2019 11:54:34 +0000 (11:54 +0000)]
docs/hypervisor-guide: Code Coverage
During a discussion in person, it was identified that Coverage doesn't
currently work for ARM yet. Also, there are a number of errors with the
existing coverage document.
Take the opportunity to rewrite it in RST, making it easier to follow for a
non-expert user.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Fri, 5 Apr 2019 15:27:13 +0000 (17:27 +0200)]
x86emul: don't read mask register on AVX512F-incapable platforms
Nor when register state isn't sufficiently enabled.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 5 Apr 2019 11:21:57 +0000 (12:21 +0100)]
automation: fix "build each commit" test
An error was introduced while rebasing 9b8b3f30. The new test
shouldn't depend on anything, otherwise artefacts will be downloaded
from build stage and cause the script to abort.
Jan Beulich [Fri, 5 Apr 2019 14:28:31 +0000 (16:28 +0200)]
x86/entry: drop unused header inclusions
I'm in particular after getting rid of asm/apicdef.h, but there are more
no longer (or perhaps never having been) used ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Make XEN_VM_EVENT_RESUME return 0 in case of success, instead of
-EINVAL.
Remove vm_event_resume form vm_event.h header and set the function's
visibility to static as is used only in vm_event.c.
Move the vm_event_check_ring test inside vm_event_resume in order to
simplify the code.
Jan Beulich [Fri, 5 Apr 2019 13:41:24 +0000 (15:41 +0200)]
x86/gnttab: relax a get_gfn() invocation
In the case here only a query is intended, i.e. without populating a
possible PoD or paged out entry, as the intention is to replace the
current (grant) entry anyway. Use get_gfn_query() there instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Apr 2019 13:40:42 +0000 (15:40 +0200)]
x86: don't allow clearing of TF_kernel_mode for other than 64-bit PV
The flag is really only meant for those, both HVM and 32-bit PV tell
kernel from user mode based on CPL/RPL. Remove the all-question-marks
comment and let's be on the safe side here and also suppress clearing
for 32-bit PV (this isn't a fast path after all).
Remove no longer necessary is_pv_32bit_*() from sh_update_cr3() and
sh_walk_guest_tables(). Note that shadow_one_bit_disable() already
assumes the new behavior.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
When issuing a vcpu_op hypercall, guests have control over the
vcpuid variable. In the old code, this allowed to perform
speculative out-of-bound accesses. To block this, we make use
of the domain_vcpu function.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Norbert Manthey [Thu, 14 Mar 2019 12:56:00 +0000 (13:56 +0100)]
x86/hvm: add nospec to hvmop param
The params array in hvm can be accessed with get and set functions.
As the index is guest controlled, make sure no out-of-bound accesses
can be performed.
As we cannot influence how future compilers might modify the
instructions that enforce the bounds, we furthermore block speculation,
so that the update is visible in the architectural state.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Acked-by: Jan Beulich <jbeulich@suse.com>
The get_page_from_gfn method returns a pointer to a page that belongs
to a gfn. Before returning the pointer, the gfn is checked for being
valid. Under speculation, these checks can be bypassed, so that
the function get_page is still executed partially. Consequently, the
function page_get_owner_and_reference might be executed partially as
well. In this function, the computed pointer is accessed, resulting in
a speculative out-of-bound address load. As the gfn can be controlled by
a guest, this access is problematic.
To mitigate the root cause, an lfence instruction is added via the
evaluate_nospec macro. To make the protection generic, we do not
introduce the lfence instruction for this single check, but add it to
the mfn_valid function. This way, other potentially problematic accesses
are protected as well.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Norbert Manthey [Thu, 14 Mar 2019 12:56:00 +0000 (13:56 +0100)]
is_hvm/pv_domain: block speculation
When checking for being an hvm domain, or PV domain, we have to make
sure that speculation cannot bypass that check, and eventually access
data that should not end up in cache for the current domain type.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Norbert Manthey [Thu, 14 Mar 2019 12:56:00 +0000 (13:56 +0100)]
is_control_domain: block speculation
Checks of domain properties, such as is_hardware_domain or is_hvm_domain,
might be bypassed by speculatively executing these instructions. A reason
for bypassing these checks is that these macros access the domain
structure via a pointer, and check a certain field. Since this memory
access is slow, the CPU assumes a returned value and continues the
execution.
In case an is_control_domain check is bypassed, for example during a
hypercall, data that should only be accessible by the control domain could
be loaded into the cache.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Norbert Manthey [Thu, 14 Mar 2019 12:55:00 +0000 (13:55 +0100)]
nospec: introduce evaluate_nospec
Since the L1TF vulnerability of Intel CPUs, loading hypervisor data into
L1 cache is problematic, because when hyperthreading is used as well, a
guest running on the sibling core can leak this potentially secret data.
To prevent these speculative accesses, we block speculation after
accessing the domain property field by adding lfence instructions. This
way, the CPU continues executing and loading data only once the condition
is actually evaluated.
As this protection is typically used in if statements, the lfence has to
come in a compatible way. Therefore, a function that returns true after an
lfence instruction is introduced. To protect both branches after a
conditional, an lfence instruction has to be added for the two branches.
To be able to block speculation after several evaluations, the generic
barrier macro block_speculation is also introduced.
As the L1TF vulnerability is only present on the x86 architecture, there is
no need to add protection for other architectures. Hence, the introduced
functions are defined but empty.
On the x86 architecture, by default, the lfence instruction is not present
either. Only when a L1TF vulnerable platform is detected, the lfence
instruction is patched in via alternative patching. Similarly, PV guests
are protected wrt L1TF by default, so that the protection is furthermore
disabled in case HVM is exclueded via the build configuration.
Introducing the lfence instructions catches a lot of potential leaks with
a simple unintrusive code change. During performance testing, we did not
notice performance effects.
Norbert Manthey [Thu, 14 Mar 2019 12:55:00 +0000 (13:55 +0100)]
spec: add l1tf-barrier
To control the runtime behavior on L1TF vulnerable platforms better, the
command line option l1tf-barrier is introduced. This option controls
whether on vulnerable x86 platforms the lfence instruction is used to
prevent speculative execution from bypassing the evaluation of
conditionals that are protected with the evaluate_nospec macro.
By now, Xen is capable of identifying L1TF vulnerable hardware. However,
this information cannot be used for alternative patching, as a CPU feature
is required. To control alternative patching with the command line option,
a new x86 feature "X86_FEATURE_SC_L1TF_VULN" is introduced. This feature
is used to patch the lfence instruction into the arch_barrier_nospec_true
function. The feature is enabled only if L1TF vulnerable hardware is
detected and the command line option does not prevent using this feature.
The status of hyperthreading is considered when automatically enabling
adding the lfence instruction. Since platforms without hyperthreading can
still be vulnerable to L1TF in case the L1 cache is not flushed properly,
the additional lfence instructions are patched in if either hyperthreading
is enabled, or L1 cache flushing is missing.
This is part of the speculative hardening effort.
Signed-off-by: Norbert Manthey <nmanthey@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 1 Apr 2019 10:08:28 +0000 (11:08 +0100)]
x86/msr: Fix handling of MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV
There are a number of bugs. There are no read/write hooks on the HVM side, so
guest accesses fall into the "read/write-discard" defaults, which bypass the
correct faulting behaviour and the Intel special case.
For the PV side, writes are discarded (again, bypassing proper faulting),
except for a pinned dom0, which is permitted to actually write the values
other than 0. This is pointless with read hook implementing the Intel special
case.
However, implementing the Intel special case is itself pointless. First of
all, OS software can't guarentee to read back 0 in the first place, because a)
this behaviour isn't guarenteed in the SDM, and b) there are SMM handlers
which use the CPUID instruction. Secondly, when a guest executes CPUID, this
doesn't typically result in Xen executing a CPUID instruction in practice.
With the dom0 special case removed, there are now no writes to this MSR other
than Xen's microcode loading facilities, which means that the value held in
the MSR will be properly up-to-date. Forward it directly, without jumping
through any hoops.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 4 Apr 2019 18:39:08 +0000 (19:39 +0100)]
x86/cpu: Renumber X86_VENDOR_* to form a bitmap
CPUs from different vendors sometimes share characteristics. All users of
X86_VENDOR_* are now direct equal/not-equal comparisons. By expressing the
X86_VENDOR_* constants in a bitmap fashon, we can more concicely and
efficiently test whether a vendor is one of a group.
Update all parts of the code which can already benefit from this improvement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 4 Apr 2019 18:19:20 +0000 (19:19 +0100)]
x86/cpu: Introduce x86_cpuid_vendor_to_str() and drop cpu_dev.c_vendor[]
cpu_dev.c_vendor[] is a char[8] array which is printed using %s in two
locations. This leads to subtle lack-of-NUL bugs when using an 8 character
vendor name.
Introduce x86_cpuid_vendor_to_str() to turn an x86_vendor into a printable
string, use it in the two locations that c_vendor is used, and drop c_vendor.
This drops the final user of X86_VENDOR_NUM, so drop that as well.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 4 Apr 2019 14:51:25 +0000 (15:51 +0100)]
x86/cpu: Drop cpu_devs[] and $VENDOR_init_cpu() hooks
These helpers each fill in a single cpu_devs[] pointer, and since c/s 00b4f4d0f "x86/cpuid: Drop get_cpu_vendor() completely", this array is read
exactly once on boot.
Delete the hooks and cpu_devs[], and have early_cpu_detect() pick the
appropriate cpu_dev structure directly.
As early_cpu_init() is empty now other than a call to early_cpu_detect(), and
this isn't expected to change moving forwards, rename the latter and delete
the former.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 5 Apr 2019 08:42:39 +0000 (10:42 +0200)]
x86emul: support AVX512{F,BW} down conversion moves
Note that the vpmov{,s,us}{d,q}w table entries in evex-disp8.c are
slightly different from what one would expect, due to them requiring
EVEX.W to be zero.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Apr 2019 08:41:59 +0000 (10:41 +0200)]
x86emul: support AVX512{F,BW} zero- and sign-extending moves
Note that the testing in simd.c doesn't really follow the ISA extension
pattern - to fit the scheme, extensions from byte and word granular
vectors can (currently) sensibly only happen in the AVX512BW case (and
hence respective abstraction macros will be added there rather than
here).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Tue, 19 Mar 2019 15:29:00 +0000 (16:29 +0100)]
viridian: add implementation of the HvSendSyntheticClusterIpi hypercall
This patch adds an implementation of the hypercall as documented in the
specification [1], section 10.5.2. This enlightenment, as with others, is
advertised by CPUID leaf 0x40000004 and is under control of a new
'hcall_ipi' option in libxl.
If used, this enlightenment should mean the guest only takes a single VMEXIT
to issue IPIs to multiple vCPUs rather than the multiple VMEXITs that would
result from using the emulated local APIC.
Paul Durrant [Tue, 19 Mar 2019 15:29:00 +0000 (16:29 +0100)]
viridian: add implementation of synthetic timers
This patch introduces an implementation of the STIMER0-15_CONFIG/COUNT MSRs
and hence a the first SynIC message source.
The new (and documented) 'stimer' viridian enlightenment group may be
specified to enable this feature.
While in the neighbourhood, this patch adds a missing check for an
attempt to write the time reference count MSR, which should result in an
exception (but not be reported as an unimplemented MSR).
NOTE: It is necessary for correct operation that timer expiration and
message delivery time-stamping use the same time source as the guest.
The specification is ambiguous but testing with a Windows 10 1803
guest has shown that using the partition reference counter as a
source whilst the guest is using RDTSC and the reference tsc page
does not work correctly. Therefore the time_now() function is used.
This implements the algorithm for acquiring partition reference time
that is documented in the specifiction.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 19 Mar 2019 15:25:00 +0000 (16:25 +0100)]
viridian: add implementation of synthetic interrupt MSRs
This patch introduces an implementation of the SCONTROL, SVERSION, SIEFP,
SIMP, EOM and SINT0-15 SynIC MSRs. No message source is added and, as such,
nothing will yet generate a synthetic interrupt. A subsequent patch will
add an implementation of synthetic timers which will need the infrastructure
added by this patch to deliver expiry messages to the guest.
NOTE: A 'synic' option is added to the toolstack viridian enlightenments
enumeration but is deliberately not documented as enabling these
SynIC registers without a message source is only useful for
debugging.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
A subsequent patch will introduce an implementaion of synthetic timers
which will also need freeze/thaw hooks, so make the exported hooks more
generic and call through to (re-named and static) time_ref_count_freeze/thaw
functions.
NOTE: This patch also introduces a new time_ref_count() helper to return
the current counter value. This is currently only used by the MSR
read handler but the synthetic timer code will also need to use it.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 19 Mar 2019 15:25:00 +0000 (16:25 +0100)]
viridian: use viridian_map/unmap_guest_page() for reference tsc page
Whilst the reference tsc page does not currently need to be kept mapped
after it is initially set up (or updated after migrate), the code can
be simplified by using the common guest page map/unmap and dump functions.
New functionality added by a subsequent patch will also require the page to
kept mapped for the lifetime of the domain.
NOTE: Because the reference tsc page is per-domain rather than per-vcpu
this patch also changes viridian_map_guest_page() to take a domain
pointer rather than a vcpu pointer. The domain pointer cannot be
const, unlike the vcpu pointer.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>