xenbits.xensource.com Git - people/iwj/xen.git/log

fixup! libxl: pvh: Add accessor macros for domain-type and -version dependent fields

tools/libxc: Multi modules support

Signed-off-by: Jonathan Ludlam <jonathan.ludlam@citrix.com>
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>

tools/libelf: fix elf notes check for PVH guest

PVH only requires PHYS32_ENTRY to be set. Return immediately if that's
the case.

Also remove the printk in pvh_load_kernel.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[remove PVH dom0 builder chunk when backporting]
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

xen/x86: report domain id on cpuid

Use the ECX register of the hypervisor leaf 5. The EAX register on
this leaf is a flags field that can be used to notice the presence of
the domain id in ECX. Note that this is only available to HVM guests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Use leaf 5 instead.

x86/upcall: inject a spurious event after setting upcall vector

In case the vCPU has pending events to inject. This fixes a bug that
happened if the guest mapped the vcpu info area using
VCPUOP_register_vcpu_info without having setup the event channel
upcall, and then setup the upcall vector.

In this scenario the guest would not receive any upcalls, because the
call to VCPUOP_register_vcpu_info would have marked the vCPU as having
pending events, but the vector could not be injected because it was
not yet setup.

This has not caused issues so far because all the consumers first
setup the vector callback and then map the vcpu info page, but there's
no limitation that prevents doing it in the inverse order.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xl: Don't warn on using 'deprecated' mode selection

We generally support old config formats indefinintely (see the disk
format) without emitting warnings.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Ported over xl_cmdimpl.c -> xl_parse.c movement

libxl: remove device model "none" from IDL

And the xl.cfg man page documentation.

It should be possible to re-introduce it in the future with a proper
implementation, in order to create a HVM guest without a device model,
which is slightly different from a PVHv2 guest.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xl: add PVH as a guest type

And remove device model "none".

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over xl_cmdimpl.c -> xl_parse.c movement

libxl: add PVH support to x86 functions

This also includes the x86 ACPI related functions. Remove support for
device model "none"

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Port over changes to libxl_x86.c

libxl: add PVH support to USB

Add PVH support to usb related functions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: remove device model "none" support from stream functions

Remove the usage of device model "none" in the migration stream
related functions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: PVH guests use PV nics

Remove device model "none" support from the nic functions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: add PVH support to memory functions

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported ovet libxl.c -> libxl_mem.c split

libxl: add PVH support to vpcu hotplug, domain destruction/pause and domain configuration

And remove support for device model "none".

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over spliting of libxl.c into libxl_domain.c

libxl: add PVH support to domain save/suspend

And remove the device model "none" support.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport over minor changes to libxl_dom_suspend.c

libxl: add PVH support to domain building

And remove device model "none" support.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over changes in libxl_dom.c

libxl: set device model for PVH guests

PVH guests use the same device model selection as PV guests, because
PVH guests only use the device model for the PV backends.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: remove device model "none" support from disk related functions

CD-ROM backend selection was partially based on the device model, this
is no longer needed since the device model "none" is now removed, so
HVM guests always have a device model.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over splitting libxl.c into libxl_disk.c

libxl: add PVH support to domain creation

Remove the device model "none" support from domain creation and
introduce support for PVH.

This requires changing some of the HVM checks to be applied for both
HVM and PVH.

Setting device model to none was never supported since it was an
unstable interface used while transitioning from PVHv1 to PVHv2.

Now that PVHv1 has been finally removed and that a supported
interface for PVHv2 is being added this option is no longer necessary,
hence it's removed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over changes to libxl_create.c

libxl: set PVH guests to use the PV console

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over move to libxl_console.c

libxl: allow PVH guests to use a bootloader

Allow PVH guests to boot using a bootloader.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

libxl: introduce a PVH guest type

The new guest type is introduced to the libxl IDL. libxl__domain_make
is also modified to save the guest type, and libxl__domain_type is
expanded to fetch that information when detecting guest type.

This is required because the hypervisor only differentiates between PV
and HVM guests, so libxl needs some extra information in order to
differentiate between a HVM and a PVH guest.

The new PVH guest type and its options are documented on the xl.cfg
man page.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Xen 4.8 backport: pvh struct contains copies of the fields that in
4.10 were added to the main libxl_domain_build_info, and the accessor
macros are updated to be use them when appropriate.

libxl: don't segfault when creating domain with invalid pvusb device

Creating a domain with an invalid controller specification for a pvusb
device will currently segfault.

Avoid this by bailing out early in case of a mandatory xenstore path
not existing.

Signed-of-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: introduce a firmware option

The new firmware option aims to provide a coherent way to set the
firmware for the different kind of guests Xen supports.

For PV guests the available firmwares are pvgrub{32|64}, and for HVM
the following are supported: bios, uefi, seabios, rombios and ovmf.
Note that uefi maps to ovmf, and bios maps to the default firmware for
each device model.

The xl.cfg man page is updated to document the new feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported over changes to xl_parse.c

xl: introduce a domain type option

Introduce a new type option to xl configuration files in order to
specify the domain type. This supersedes the current builder option.

The new option is documented in the xl.cfg man page, and the previous
builder option is marked as deprecated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

MAYBE FIX libxl/xl: use the new location of domain_build_info fields

This is required because those options will be used by the new PVH
guest type, and thus need to be shared between PV and HVM.

Defines are added in order to signal consumers that the fields are
available.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Rebased over changes to libxl.h
4.8: Rebased over lots of changes

xl: pvh: Use accessor macros

Again, no functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: pvh: Use accessor macros internally

This prepares the way for the pvh guest type, which will duplicate
some of the fields in the hvm libxl_domain_build_info. (In 4.10 these
fields were extracted into the top level but that is an ABI break so
is not suitable for 4.8/4.9.)

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: pvh: Add accessor macros for domain-type and -version dependent fields

We are going to have some fields whose location now depends in a
complicated way on the domain type, the Xen release, and the Meltdown
patch status.

Introduce some macros to help access these fields. Currently they
don't do very much interesting because we don't have the PVH guest
type.

And nothing uses them yet, so no functional change in this patch.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

xl: parsing code movement

Code movement in preparation for making the bootloader,
bootloader_args, nested_hvm and timer_mode fields shared between all
guests types. While moving the code, limit the line-length to 80
columns.

No functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ported to modify xl_cmdimpl.c (which was refactored in 4.9).

libxl: add is_default checkers for string and timer_mode types

Those types are missing a helper to check whether a definition of the
type holds the default value. This will be required by a later patch
that will implement deprecation of fields inside of a libxl type.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Rebased over changes to libxl_internal.h

libxl: Update xenstore on VCPU hotplug for all guest types

Currently HVM guests that use upstream qemu do not update xenstore's
availability entry for VCPUs. While it is not strictly necessary for
hotplug to work, xenstore ends up not reflecting actual status of
VCPUs. We should fix this.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/pmtimer: move ACPI registers from PMTState to hvm_domain

These registers (pm1a specifically) are not all specific to pm timer
and are accessed by non-pmtimer code (for example, sleep/power button
emulation).

The public name for save state structure is kept as 'pmtimer' to avoid
code churn with the expected changes in migration code. hvm_hw_acpi
name is introduced for internal use but when migration code is updated
hvm_hw_pmtimer will be renamed to hvm_hw_acpi.

No functional changes are introduced.

(While this file is being modified, also add emacs mode style rune)

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

firmware/rombios: fix after update to libacpi

Fix a build breakage after the libacpi changes, this is due to rombios using the
libacpi headers in order to parse the ACPI tables.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

libacpi: announce that PVHv2 has no CMOS RTC in FADT

At the moment this flag is unconditionally set for PVHv2 domains. Note that
using this boot flag requires that the FADT table revision is at least 5.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

libacpi: update FADT layout to support version 5

Update the structure of the FADT table to version 5, and use that version for
PVHv2 guests. Note that HVM guests will continue to use FADT 4. In order to do
this, add a new field to acpi_config that contains the ACPI revision to use by
libacpi. Note that currently this only applies to the FADT.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

libacpi: don't announce a 8042 controller in the FADT for PVHv2 guests

There's no such controler available for PVHv2 guests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

libacpi: set FADT boot flag to notify lack of VGA for PVHv2 guests

PVHv2 guests don't have any VGA card, and as so it must be notified in the FADT.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libacpi: add _FADT_ to the FADT boot flags definitions

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

acpi: PVH guests need _E02 method

This is the method that will get invoked on an SCI.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

acpi: power and sleep ACPI buttons are not emulated for PVH guests

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

acpi: make pmtimer optional in FADT

PM timer is not supported by PVH guests.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

libxl: fix gentypes call in Makefile

From the make documentation:

"$* [...] If the target is `dir/a.foo.b' and the target pattern is
`a.%.b' then the stem is `dir/foo'. In a static pattern rule, the
stem is part of the file name that matched the `%' in the target
pattern."

The rule generating the c types files from the idl ones is not
a static pattern rule, but rather an implicit rule. Thus the value
of $* is preceded by the file path, instead of only what matches %.

In order to get this fixed, drop the path using a $(notdir $*).

Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

gnttab: also validate PTE permissions upon destroy/replace

In order for PTE handling to match up with the reference counting done
by common code, presence and writability of grant mapping PTEs must
also be taken into account; validating just the frame number is not
enough. This is in particular relevant if a guest fiddles with grant
PTEs via non-grant hypercalls.

Note that the flags being passed to replace_grant_host_mapping()
already happen to be those of the existing mapping, so no new function
parameter is needed.

This is CVE-2017-14319 / XSA-234.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 16b1414de91b5a82a0996c67f6db3af7d7e32873
master date: 2017-09-12 14:45:13 +0200

tools/xenstore: dont unlink connection object twice

A connection object of a domain with associated stubdom has two
parents: the domain and the stubdom. When cleaning up the list of
active domains in domain_cleanup() make sure not to unlink the
connection twice from the same domain. This could happen when the
domain and its stubdom are being destroyed at the same time leading
to the domain loop being entered twice.

Additionally don't use talloc_free() in this case as it will remove
a random parent link, leading eventually to a memory leak. Use
talloc_unlink() instead specifying the context from which the
connection object should be removed.

This is CVE-2017-14317 / XSA-233.

Reported-by: Eric Chanudet <chanudete@ainfosec.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 562a1c0f7ef3fbf3c122c3dfa4f2ad9dd51da9fe
master date: 2017-09-12 14:44:56 +0200

grant_table: fix GNTTABOP_cache_flush handling

Don't fall over a NULL grant_table pointer when the owner of the domain
is a system domain (DOMID_{XEN,IO} etc).

This is CVE-2017-14318 / XSA-232.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c3d830b244998b3686e2eb64db95996be5eb5e5c
master date: 2017-09-12 14:44:11 +0200

xen/mm: make sure node is less than MAX_NUMNODES

The output of MEMF_get_node(memflags) can be as large as nodeid_t can
hold (currently 255). This is then used as an index to arrays of size
MAX_NUMNODE, which is 64 on x86 and 1 on ARM, can be passed in by an
untrusted guest (via memory_exchange and increase_reservation) and is
not currently bounds-checked.

Check the value in page_alloc.c before using it, and also check the
value in the hypercall call sites and return -EINVAL if appropriate.
Don't permit domains other than the hardware or control domain to
allocate node-constrained memory.

This is CVE-2017-14316 / XSA-231.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fece35303529395bfea6b03d2268380ef682c93
master date: 2017-09-12 14:43:16 +0200

update Xen version to 4.8.3-pre

update Xen version to 4.8.2

gnttab: avoid spurious maptrack handle allocation failures

When no memory is available in the hypervisor, rather than immediately
failing the request, try to steal a handle from another vCPU.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d02f1a0b7576bafb2fba903c7e6e7221ab0d2847
master date: 2017-08-17 14:41:01 +0200

cpufreq: only stop ondemand governor if already started

On CPUFREQ_GOV_STOP in cpufreq_governor_dbs, shortcut to
return success if the governor is already stopped.

Avoid executing dbs_timer_exit, to prevent tripping an assertion
within a call to kill_timer on a timer that has not been prepared
with init_timer, if the CPUFREQ_GOV_START case has not
run beforehand.

kill_timer validates timer state:
* itself, via BUG_ON(this_cpu(timers).running == timer);
* within active_timer, ASSERTing timer->status is within bounds;
* within list_del, which ASSERTs timer inactive list membership.

Patch is synonymous to an OpenXT patch produced at Citrix prior to
June 2014.

Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e7ec6f5f32cd2d0f723083cde3d7761c4e675f2c
master date: 2017-08-10 12:35:50 +0200

VT-d PI: disable VT-d PI when CPU-side PI isn't enabled

From the context calling pi_desc_init(), we can conclude the current
implementation of VT-d PI depends on CPU-side PI. If we enable VT-d PI
and disable CPU-side PI by disabling APICv explicitly in xen boot
command line, we would get an assertion failure.

This patch clears iommu_intpost once finding CPU-side PI won't be enabled.
It is safe for this is done before this flag starts taking effect. Also
take this chance to remove the useless check of "acknowledge interrupt on
exit", which is a minimal requirement which has been checked earlier.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: e489eb6138e7efe4214a7e9ba0d21f54fc5b7d35
master date: 2017-08-10 12:32:16 +0200

VT-d: don't panic/warn on iommu=no-igfx

When operating on an Intel graphics device, iommu_enable_translation()
panicked (force_iommu==1) or warned (force_iommu==0) about the BIOS if
is_igd_vt_enabled_quirk() returned 0. That's good if the actual BIOS
problem has been detected. But since commit 1463411, returning 0 could
also happen if the user simply passed "iommu=no-igfx", in which case
bailing out with an info message (instead of a panic/warning) would be
more appropriate.

The panic broke the combination "iommu=force,no-igfx", and also the case
where "iommu=no-igfx" is passed but force_iommu=1 is set automatically
by x2apic_bsp_setup().

Move the iommu_igfx check from is_igd_vt_enabled_quirk() into its only
caller iommu_enable_translation(), and tweak the logic.

Signed-off-by: Rusty Bird <rustybird@openmailbox.org>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: dbf2a768565d8b79c65471a3d3b982b2874d6492
master date: 2017-08-03 12:40:25 +0200

docs: replace xm with xl in xen-tscmode

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 763267e315a93e2b6d66a0afdcda96db939e09b6
master date: 2017-07-24 10:17:21 +0100

x86/hvm: Fixes to hvmemul_insn_fetch()

Force insn_off to a single byte, as offset can wrap around or truncate with
respect to sh_ctxt->insn_buf_eip under a number of normal circumstances.

Furthermore, don't use an ASSERT() for bounds checking the write into
hvmemul_ctxt->insn_buf[].

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/hvm: Fix boundary check in hvmemul_insn_fetch()

c/s 0943a03037 added some extra protection for overflowing the emulation
instruction cache, but Coverity points out that boundary condition is off by
one when memcpy()'ing out of the buffer.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
x86/HVM: fix boundary check in hvmemul_insn_fetch() (again)

Commit 5a992b670b ("x86/hvm: Fix boundary check in
hvmemul_insn_fetch()") went a little too far in its correction to
commit 0943a03037 ("x86/hvm: Fixes to hvmemul_insn_fetch()"): Keep the
start offset check, but restore the original end offset one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 0943a03037418e6e40cdd420f2472bbf9afae7a2
master date: 2017-07-19 10:25:18 +0100
master commit: 5a992b670bff697c40b513c9e037598ba35ca7d4
master date: 2017-07-27 11:39:57 +0100
master commit: 58e8986267d976b00c60e0089baa2e5f66f16d3e
master date: 2017-08-10 12:37:24 +0200

rombios: prevent building with PIC/PIE

If the default compiler silently defaults to to -fPIC/-fPIE building
rombios fails:

ld -melf_i386 -s -r 32bitbios.o tcgbios/tcgbiosext.o util.o pmm.o -o 32bitbios_all.o
There are undefined symbols in the BIOS:
U _GLOBAL_OFFSET_TABLE_
make[10]: *** [Makefile:26: 32bitbios_all.o] Error 11

Prevent the failure by enforcing non-PIC/PIE mode.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 672949d6c61d9cba01c5b414eed9d522082f04d3
master date: 2017-06-26 14:32:46 +0100

xen/livepatch: Don't crash on encountering STN_UNDEF relocations

A symndx of STN_UNDEF is special, and means a symbol value of 0. While
legitimate in the ELF standard, its existance in a livepatch is questionable
at best. Until a plausible usecase presents itself, reject such a relocation
with -EOPNOTSUPP.

Additionally, fix an off-by-one error while range checking symndx, and perform
a safety check on elf->sym[symndx].sym before derefencing it, to avoid
tripping over a NULL pointer when calculating val.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32]
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 2ff229643b739e2fd0cd0536ee9fca506cfa92f8
master date: 2017-06-23 15:00:37 +0100

xen/livepatch: Use zeroed memory allocations for arrays

Each of these arrays is sparse. Use zeroed allocations to cause uninitialised
array elements to contain deterministic values, most importantly for the
embedded pointers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32]
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: bd53b85156aaf0330181ab9b82d9a6c52fb30f8c
master date: 2017-06-23 15:00:37 +0100

arm/mm: release grant lock on xenmem_add_to_physmap_one() error paths

Commit 55021ff9ab ("xen/arm: add_to_physmap_one: Avoid to map mfn 0 if
an error occurs") introduced error paths not releasing the grant table
lock. Replace them by a suitable check after the lock was dropped.

This is XSA-235.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master commit: 59546c1897a90fe9af5ebbbb05ead8d98b4d17b9
master date: 2017-08-23 17:45:45 +0200

travis: install ghostscript

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Doug Goldstein <cardoe@cardoe.com>
master commit: f7826750ebb7f104cc9ccee06042fbbef4a777e2
master date: 2017-06-27 16:58:17 +0100

gnttab: fix "don't use possibly unbounded tail calls"

The compat mode code also needs adjustment to deal with the changed
return value from gnttab_copy().

This is part of XSA-226.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca617570542e1d7d8de636d5396959bbf1dabab7
master date: 2017-08-21 15:43:36 +0200

gnttab: fix transitive grant handling

Processing of transitive grants must not use the fast path, or else
reference counting breaks due to the skipped recursive call to
__acquire_grant_for_copy() (its __release_grant_for_copy()
counterpart occurs independent of original pin count). Furthermore
after re-acquiring temporarily dropped locks we need to verify no grant
properties changed if the original pin count was non-zero; checking
just the pin counts is sufficient only for well-behaved guests. As a
result, __release_grant_for_copy() needs to mirror that new behavior.

Furthermore a __release_grant_for_copy() invocation was missing on the
retry path of __acquire_grant_for_copy(), and gnttab_set_version() also
needs to bail out upon encountering a transitive grant.

This is part of XSA-226.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad48fb963dbff02762d2db5396fa655ac0c432c7
master date: 2017-08-17 14:40:31 +0200

gnttab: don't use possibly unbounded tail calls

There is no guarantee that the compiler would actually translate them
to branches instead of calls, so only ones with a known recursion limit
are okay:
- __release_grant_for_copy() can call itself only once, as
  __acquire_grant_for_copy() won't permit use of multi-level transitive
  grants,
- __acquire_grant_for_copy() is fine to call itself with the last
  argument false, as that prevents further recursion,
- __acquire_grant_for_copy() must not call itself to recover from an
  observed change to the active entry's pin count

This is part of XSA-226.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 999d2ccb7f73408aa22656e1ba2f98b077eaa1c2
master date: 2017-08-17 14:39:18 +0200

gnttab: correct pin status fixup for copy

Regardless of copy operations only setting GNTPIN_hst*, GNTPIN_dev*
also need to be taken into account when deciding whether to clear
_GTF_{read,writ}ing. At least for consistency with code elsewhere the
read part better doesn't use any mask at all.

This is XSA-230.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6e2a4c73564ab907b732059adb317d6ca2d138a2
master date: 2017-08-15 15:08:03 +0200

gnttab: split maptrack lock to make it fulfill its purpose again

The way the lock is currently being used in get_maptrack_handle(), it
protects only the maptrack limit: The function acts on current's list
only, so races on list accesses are impossible even without the lock.

Otoh list access races are possible between __get_maptrack_handle() and
put_maptrack_handle(), due to the invocation of the former for other
than current from steal_maptrack_handle(). Introduce a per-vCPU lock
for list accesses to become race free again. This lock will be
uncontended except when it becomes necessary to take the steal path,
i.e. in the common case there should be no meaningful performance
impact.

When in get_maptrack_handle adds a stolen entry to a fresh, empty,
freelist, we think that there is probably no concurrency. However,
this is not a fast path and adding the locking there makes the code
clearly correct.

Also, while we are here: the stolen maptrack_entry's tail pointer was
not properly set. Set it.

This is CVE-2017-12136 / XSA-228.

Reported-by: Ian Jackson <ian.jackson@eu.citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: 02cbeeb6207508b0f04a2c6181445c8eb3f1e117
master date: 2017-08-15 15:07:25 +0200

x86/grant: disallow misaligned PTEs

Pagetable entries must be aligned to function correctly. Disallow attempts
from the guest to have a grant PTE created at a misaligned address, which
would result in corruption of the L1 table with largely-guest-controlled
values.

This is CVE-2017-12137 / XSA-227.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ce442926c2530da9376199dcc769436376ad2386
master date: 2017-08-15 15:06:45 +0200

arm: p2m: Prevent redundant icache flushes

When toolstack requests flushing the caches, flush_page_to_ram() is
called for each page of the requested domain. This needs to unnecessary
icache invalidation operations.

Let's take the responsibility of performing icache operations and use
the recently introduced flag to prevent redundant icache operations by
flush_page_to_ram().

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 1a0c3e3e28d6cd072734990efcaaec608bf152b1)

Allow control of icache invalidations when calling flush_page_to_ram()

flush_page_to_ram() unconditionally drops the icache. In certain
situations this leads to execessive icache flushes when
flush_page_to_ram() ends up being repeatedly called in a loop.

Introduce a parameter to allow callers of flush_page_to_ram() to take
responsibility of synchronising the icache. This is in preparations for
adding logic to make the callers perform the necessary icache
maintenance operations.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 54b8651066e82f04db9d9e5b0cc02c26d39ae763)

xen/arm: Properly map the FDT in the boot page table

Currently, Xen is assuming the FDT will always fit in a 2MB section.
Recently, I noticed an early crash on Xen when using GRUB with the
following call trace:

(XEN) Hypervisor Trap. HSR=0x96000006 EC=0x25 IL=1 Syndrome=0x6
(XEN) CPU0: Unexpected Trap: Hypervisor
(XEN) ----[ Xen-4.9-unstable  arm64  debug=y   Not tainted ]----
(XEN) CPU:    0
(XEN) PC:     0000000000264140 strlen+0x10/0x84
(XEN) LR:     00000000002401c0
(XEN) SP:     00000000002cfc20
(XEN) CPSR:   400003c9 MODE:64-bit EL2h (Hypervisor, handler)
(XEN)      X0: 0000000000801230  X1: 0000000000801230  X2: 0000000000005230
(XEN)      X3: 0000000000000030  X4: 0000000000000030  X5: 0000000000000038
(XEN)      X6: 0000000000000034  X7: 0000000000000000  X8: 7f7f7f7f7f7f7f7f
(XEN)      X9: 64622c6479687222 X10: 7f7f7f7f7f7f7f7f X11: 0101010101010101
(XEN)     X12: 0000000000000030 X13: ffffff00ff000000 X14: 0800000003000000
(XEN)     X15: ffffffffffffffff X16: 00000000fefff610 X17: 00000000000000f0
(XEN)     X18: 0000000000000004 X19: 0000000000000008 X20: 00000000007fc040
(XEN)     X21: 00000000007fc000 X22: 000000000000000e X23: 0000000000000000
(XEN)     X24: 00000000002a9f58 X25: 0000000000801230 X26: 00000000002a9f68
(XEN)     X27: 00000000002a9f58 X28: 0000000000298910  FP: 00000000002cfc20
(XEN)
(XEN)   VTCR_EL2: 80010c40
(XEN)  VTTBR_EL2: 0000082800203000
(XEN)
(XEN)  SCTLR_EL2: 30c5183d
(XEN)    HCR_EL2: 000000000038663f
(XEN)  TTBR0_EL2: 00000000f4912000
(XEN)
(XEN)    ESR_EL2: 96000006
(XEN)  HPFAR_EL2: 00000000e8071000
(XEN)    FAR_EL2: 0000000000801230
(XEN)
(XEN) Xen stack trace from sp=00000000002cfc20:
(XEN)    00000000002cfc70 0000000000240254 00000000002a9f58 00000000007fc000
(XEN)    0000000000000000 0000000000000000 0000000000000000 00000000007fc03c
(XEN)    00000000002cfd78 0000000000000000 00000000002cfca0 00000000002986fc
(XEN)    0000000000000000 00000000007fc000 0000000000000000 0000000000000000
(XEN)    00000000002cfcc0 0000000000298f1c 0000000000000000 00000000007fc000
(XEN)    00000000002cfdc0 000000000029904c 00000000f47fc000 00000000f4604000
(XEN)    00000000f47fc000 00000000007fc000 0000000000400000 0000000000000100
(XEN)    00000000f4604000 0000000000000001 0000000000000001 8000000000000002
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 00000000002cfdc0 0000000000299038
(XEN)    00000000f47fc000 00000000f4604000 00000000f47fc000 0000000000000000
(XEN)    00000000002cfe20 000000000029c420 00000000002d8000 00000000f4604000
(XEN)    00000000f47fc000 0000000000000000 0000000000400000 0000000000000100
(XEN)    00000000f4604000 0000000000000001 00000000f47fc000 000000000029c404
(XEN)    00000000fefff510 0000000000200624 00000000f4804000 00000000f4604000
(XEN)    00000000f47fc000 0000000000000000 0000000000400000 0000000000000100
(XEN)    0000000000000001 0000000000000001 0000000000000001 8000000000000002
(XEN)    00000000f47fc000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<0000000000264140>] strlen+0x10/0x84 (PC)
(XEN)    [<00000000002401c0>] fdt_get_property_namelen+0x9c/0xf0 (LR)
(XEN)    [<0000000000240254>] fdt_get_property+0x40/0x50
(XEN)    [<00000000002986fc>] bootfdt.c#device_tree_get_u32+0x18/0x5c
(XEN)    [<0000000000298f1c>] device_tree_for_each_node+0x84/0x144
(XEN)    [<000000000029904c>] boot_fdt_info+0x70/0x23c
(XEN)    [<000000000029c420>] start_xen+0x9c/0xd30
(XEN)    [<0000000000200624>] arm64/head.o#paging+0x84/0xbc
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Hypervisor
(XEN)
(XEN) ****************************************

Indeed, the booting documentation for AArch32 and AArch64 only requires
the FDT to be placed on a 8-byte boundary. This means the Device-Tree can
cross a 2MB boundary.

Given that Xen limits the size of the FDT to 2MB, it will always fit in
a 4MB slot. So extend the fixmap slot for FDT from 2MB to 4MB.

The second 2MB superpage will only be mapped if the FDT is cross the 2MB
boundary.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit de22e0dd0df7244052b978fdd939aae7a0937077)

xen/arm: Check if the FDT passed by the bootloader is valid

There is currently no sanity check on the FDT passed by the bootloader.
Whilst they are stricly not necessary, it will avoid us to spend hours
to try to find out why it does not work.

>From the booting documentation for AArch32 [1] and AArch64 [2] must :
- be placed on 8-byte boundary
- not exceed 2MB (only on AArch64)

Even if AArch32 does not seem to limit the size, Xen is not currently
able to support more the 2MB FDT. It is better to crash rather with a nice
error message than claiming we are supporting any size of FDT.

The checks are mostly borrowed from the Linux code (see fixmap_remap_fdt
in arch/arm64/mm/mmu.c).

[1] Section 2 in linux/Documentation/arm64/booting.txt
[2] Section 4b in linux/Documentation/arm/Booting

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4a9d402746fd6fbd1f319eff29fa768f7d7aa62b)

xen/arm: Move the code to map FDT in the boot tables from assembly to C

The FDT will not be accessed before start_xen (begining of C code) is
called and it will be easier to maintain as the code could be common
between AArch32 and AArch64.

A new function early_fdt_map is introduced to map the FDT in the boot
page table.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f7d93cee1160bf18f77750da2d5900aee9ca15cc)

xen/arm: mm: Move create_mappings function earlier in the file

This function will be called by other function later one. This will
avoid forward declaration and keep the new function close to sibling
ones.

This was moved just after *_fixmap helpers as they are page table
handling functions too.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit caa4742f2f6a883ce169cd55de35fc101ac3e0a2)

memory: don't suppress P2M update in populate_physmap()

Commit d18627583d ("memory: don't hand MFN info to translated guests")
wrongly added a null-handle check there - just like stated in its
description for memory_exchange(), the array is also an input for
populate_physmap() (and hence can't reasonably be null). I have no idea
how I've managed to overlook this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b964e3106d2cdaa11cc4524181ff14607d110ae4
master date: 2017-06-20 14:51:53 +0200

livepatch: Wrong usage of spinlock on debug console.

If we have a large amount of livepatches and want to print them
on the console using 'xl debug-keys x' we eventually hit
the preemption check:

if ( i && !(i % 64) )
{
spin_unlock(&payload_lock);
process_pending_softirqs();
if ( spin_trylock(&payload_lock) )
return

<facepalm> The effect is that we have just effectively
taken the lock and returned without unlocking!

Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 75dfe7c566c36e0af4714557a666827f49b69191
master date: 2017-06-12 08:59:33 -0400

Revert "x86/hvm: disable pkeys for guests in non-paging mode"

This reverts commit c41e0266dd59ab50b7a153157e9bd2a3ad114b53.

When determining Access Rights, Protection Keys only take effect when CR4.PKE
it set, and 4-level paging is active. All other circumstances (notibly, 32bit
PAE paging) skip the Protection Key control mechanism.

Therefore, we do not need to clear CR4.PKE behind the back of a guest which is
not using paging, as such a guest is necesserily running with EFER.LMA
disabled.

The {RD,WR}PKRU instructions are specified as being legal for use in any
operating mode, but only if CR4.PKE is set. By clearing CR4.PKE behind the
back of an unpaged guest, these instructions yield #UD despite the guest
correctly seeing PKE set if it reads CR4, and OSPKE being visible in CPUID.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Huaitong Han <huaitong.han@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 224acdd04a9f6ffe44d2f716287cac74787899ec
master date: 2017-06-01 14:13:57 +0100

x86/pv: Fix the handling of `int $x` for vectors which alias exceptions

The claim at the top of c/s 2e426d6eecf "x86/traps: Drop use_error_code
parameter from do_{,guest_}trap()" is only actually true for hardware
exceptions. It is not true for `int $x` instructions (which never push error
code), irrespective of whether the vector aliases an exception or not.

Furthermore, c/s 6480cc6280e "x86/traps: Fix failed ASSERT() in
do_guest_trap()" really should have helped highlight that a regression had
been introduced.

Modify pv_inject_event() to understand event types other than
X86_EVENTTYPE_HW_EXCEPTION, and introduce pv_inject_sw_interrupt() for the
`int $x` handling code.

Add further assertions to pv_inject_event() concerning the type of events
passed in, which in turn requires that do_guest_trap() set its type
appropriately (which is now used exclusively for hardware exceptions).

This is logically a backport of c/s 5c4f579e0ee4f38cad5636bbf8ce700a394338d0
from Xen 4.9, but disentangled from the other injection work.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/test/Makefile: Fix clean target, broken by pattern rule

In "xen/test/livepatch: Regularise Makefiles" we reworked
xen/test/Makefile to use a pattern rule.  However, there are two
problems with this.  Both are related to the way that xen/Rules.mk is
implicitly part of this Makefile because of the way that Makefiles
under xen/ are invoked by their parent directory Makefiles.

Firstly, the Rules.mk `clean' target overrides the pattern rule in
xen/test/Makefile.  The result is that `make -C xen clean' does not
actually run the livepatch clean target.

The Rules.mk clean target does have provision for recursing into
subdirectories, but that feature is tangled up with complex object
file iteration machinery which is not desirable here.  However, we can
extend the Rules.mk clean target since it is a double-colon rule.

Sadly this involves duplicating the SUBDIR iteration boilerplate.  (A
make function could be used but the cure would be worse than the
disease.)

Secondly, Rules.mk has a number of -include directives.  make likes to
try to (re)build files mentioned in includes.  With the % pattern
rule, this applies to those files too.

As a result, make -C xen clean would try to build `.*.d' (for example)
in xen/test.  This would fail with an error message.  The error would
be ignored because of the `-', but it's annoying and ugly.

Solve this by limiting the % pattern rule to the targets we expect it
to handle.  These are those listed in the top-level Makefile help
message, apart from: those which are subdir- or component-qualified;
clean targets (which are handled specially, even distclean); and dist,
src-tarball-*, etc. (which are converted to install by an earlier
Makefile).

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 592e834522086009975bd48d59386094771bd06b)
(cherry picked from commit b38b1479a532f08fedd7f3b761673bc78b66739d)

x86: avoid leaking PKRU and BND* between vCPU-s

PKRU is explicitly "XSAVE-managed but not XSAVE-enabled", so guests
might access the register (via {RD,WR}PKRU) without setting XCR0.PKRU.
Force context switching as well as migrating the register as soon as
CR4.PKE is being set the first time.

For MPX (BND<n>, BNDCFGU, and BNDSTATUS) the situation is less clear,
and the SDM has not entirely consistent information for that case.
While experimentally the instructions don't change register state as
long as the two XCR0 bits aren't both 1, be on the safe side and enable
both if BNDCFGS.EN is being set the first time.

This is XSA-220.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: de20bb6c4f65c4161e0931402613f9ffac86302d
master date: 2017-06-20 14:36:51 +0200

xen/arm: vgic: Sanitize target mask used to send SGI

The current function vgic_to_sgi does not sanitize the target mask and
may therefore get an invalid vCPU ID. This will result to an out of
bound access of d->vcpu[...] as there is no check whether the vCPU ID is
within the maximum supported by the guest.

This was introduced by commit ea37fd2111 "xen/arm: split vgic driver
into generic and vgic-v2 driver".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 6fb94196730f30929f1e617fd1d05daf55376664
master date: 2017-06-20 14:47:07 +0200

gnttab: __gnttab_unmap_common_complete() is all-or-nothing

All failures have to be detected in __gnttab_unmap_common(), the
completion function must not skip part of its processing. In particular
the GNTMAP_device_map related putting of page references and adjustment
of pin count must not occur if __gnttab_unmap_common() signaled an
error. Furthermore the function must not make adjustments to global
state (here: clearing GNTTAB_device_map) before all possibly failing
operations have been performed.

There's one exception for IOMMU related failures: As IOMMU manipulation
occurs after GNTMAP_*_map have been cleared already, the related page
reference and pin count adjustments need to be done nevertheless. A
fundamental requirement for the correctness of this is that
iommu_{,un}map_page() crash any affected DomU in case of failure.

The version check appears to be pointless (or could perhaps be a
BUG_ON() or ASSERT()), but for the moment also move it.

This is part of XSA-224.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 11fc7ccb7217ccfb79edb727d1349618bcb0602d
master date: 2017-06-20 14:46:47 +0200

gnttab: correct logic to get page references during map requests

The rules for reference counting are somewhat complicated:

* Each of GNTTAB_host_map and GNTTAB_device_map need their own
reference count

* If the mapping is writeable:
- GNTTAB_host_map needs a type count under only some conditions
- GNTTAB_device_map always needs a type count

If the mapping succeeds, we need to keep all of these; if the mapping
fails, we need to release whatever references we have acquired so far.

Additionally, the code that does a lot of this calculation "inherits"
a reference as part of the process of finding out who the owner is.

Finally, if the grant is mapped as writeable (without the
GNTMAP_readonly flag), but the hypervisor cannot grab a
PGT_writeable_page type, the entire operation should fail.

Unfortunately, the current code has several logic holes:

* If a grant is mapped only GNTTAB_device_map, and with a writeable
  mapping, but in conditions where a *host* type count is not
  necessary, the code will fail to grab the necessary type count.

* If a grant is mapped both GNTTAB_device_map and GNTTAB_host_map,
  with a writeable mapping, in conditions where the host type count is
  not necessary, *and* where the page cannot be changed to type
  PGT_writeable, the condition will not be detected.

In both cases, this means that on success, the type count will be
erroneously reduced when the grant is unmapped.  In the second case,
the type count will be erroneously reduced on the failure path as
well.  (In the first case the failure path logic has the same hole
as the reference grabbing logic.)

Additionally, the return value of get_page() is not checked; but this
may fail even if the first get_page() succeeded due to a reference
counting overflow.

First of all, simplify the restoration logic by explicitly counting
the reference and type references acquired.

Consider each mapping type separately, explicitly marking the
'incoming' reference as used so we know when we need to grab a second
one.

Finally, always check the return value of get_page[_type]() and go to
the failure path if appropriate.

This is part of XSA-224.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 75b384ece635adc55c2bafbdc2d8959c10542c31
master date: 2017-06-20 14:46:21 +0200

gnttab: never create host mapping unless asked to

We shouldn't create a host mapping unless asked to even in the case of
mapping a granted MMIO page. In particular the mapping wouldn't be torn
down when processing the matching unmap request.

This is part of XSA-224.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 56f2ab5b970f1b18cf2019df4bf27db544cda6ea
master date: 2017-06-20 14:46:01 +0200

gnttab: fix handling of dev_bus_addr during unmap

If a grant has been mapped with the GNTTAB_device_map flag, calling
grant_unmap_ref() with dev_bus_addr set to zero should cause the
GNTTAB_device_map part of the mapping to be left alone.

Unfortunately, at the moment, op->dev_bus_addr is implicitly checked
before clearing the map and adjusting the pin count, but only the bits
above 12; and it is not checked at all before dropping page
references. This means a guest can repeatedly make such a call to
cause the reference count to drop to zero, causing the page to be
freed and re-used, even though it's still mapped in its pagetables.

To fix this, always check op->dev_bus_addr explicitly for being
non-zero, as well as op->flag & GNTMAP_device_map, before doing
operations on the device_map.

While we're here, make the logic a bit cleaner:

* Always initialize op->frame to zero and set it from act->frame, to reduce the
chance of untrusted input being used

* Explicitly check the full dev_bus_addr against act->frame <<
PAGE_SHIFT, rather than ignoring the lower 12 bits

This is part of XSA-224.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8fdfcb2b6bcd074776560e76843815f124d587f1
master date: 2017-06-20 14:45:33 +0200

arm: vgic: Don't update the LR when the IRQ is not enabled

gic_raise_inflight_irq will be called if the IRQ is already inflight
(i.e the IRQ is injected to the guest). If the IRQ is already already in
the LRs, then the associated LR will be updated.

To know if the interrupt is already in the LR, the function check if the
interrupt is queued. However, if the interrupt is not enabled then the
interrupt may not be queued nor in the LR. So gic_update_one_lr may be
called (if we inject on the current vCPU) and read the LR.

Because the interrupt is not in the LR, Xen will either read:
    * LR 0 if the interrupt was never injected before
    * LR 255 (GIC_INVALID_LR) if the interrupt was injected once. This
    is because gic_update_one_lr will reset p->lr.

Reading LR 0 will result to potentially update the wrong interrupt and
not keep the LRs in sync with Xen.

Reading LR 255 will result to:
    * Crash Xen on GICv3 as the LR index is bigger than supported (see
    gicv3_ich_read_lr).
    * Read/write always GICH_LR + 255 * 4 that is not part of the memory
    mapped.

The problem can be prevented by checking whether the interrupt is
enabled in gic_raise_inflight_irq before calling gic_update_one_lr.

A follow-up of this patch is expected to mitigate the issue in the
future.

This is XSA-223.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: c84e4b2dd4050ef3eecc13fcfa6842373ba4519c
master date: 2017-06-20 14:41:55 +0200

guest_physmap_remove_page() needs its return value checked

Callers, namely such subsequently freeing the page, must not blindly
assume success - the function may namely fail when needing to shatter a
super page, but there not being memory available for the then needed
intermediate page table.

As it happens, guest_remove_page() callers now also all check the
return value.

Furthermore a missed put_gfn() on an error path in gnttab_transfer() is
also being taken care of.

This is part of XSA-222.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0cce6048d010a30ac82f8db7787bbf9aada64f4
master date: 2017-06-20 14:41:16 +0200

memory: fix return value handing of guest_remove_page()

Despite the description in mm.h, guest_remove_page() previously returned 0 for
paging errors.

Switch guest_remove_page() to having regular 0/-error semantics, and propagate
the return values from clear_mmio_p2m_entry() and mem_sharing_unshare_page()
to the callers (although decrease_reservation() is the only caller which
currently cares).

This is part of XSA-222.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b614f642c35da5184416787352f51a6379a92628
master date: 2017-06-20 14:39:56 +0200

evtchn: avoid NULL derefs

Commit fbbd5009e6 ("evtchn: refactor low-level event channel port ops")
added a de-reference of the struct evtchn pointer for a port without
first making sure the bucket pointer is non-NULL. This de-reference is
actually entirely unnecessary, as all relevant callers (beyond the
problematic do_poll()) already hold the port number in their hands, and
the actual leaf functions need nothing else.

For FIFO event channels there's a second problem in that the ordering
of reads and updates to ->num_evtchns and ->event_array[] was so far
undefined (the read side isn't always holding the domain's event lock).
Add respective barriers.

This is XSA-221.

Reported-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: e7719a0dfac7a20cb7da5529e09773d8271bb78b
master date: 2017-06-20 14:37:47 +0200

x86/shadow: hold references for the duration of emulated writes

The (misnamed) emulate_gva_to_mfn() function translates a linear address to an
mfn, but releases its page reference before returning the mfn to its caller.

sh_emulate_map_dest() uses the results of one or two translations to construct
a virtual mapping to the underlying frames, completes an emulated
write/cmpxchg, then unmaps the virtual mappings.

The page references need holding until the mappings are unmapped, or the
frames can change ownership before the writes occurs.

This is XSA-219.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 26217aff67ae1538d4e1b2226afab6993cdbe772
master date: 2017-06-20 14:36:11 +0200

gnttab: correct maptrack table accesses

In order to observe a consistent (limit,pointer-table) pair, the reader
needs to either hold the maptrack lock (in line with documentation) or
both sides need to order their accesses suitably (the writer side
barrier was removed by commit dff515dfea ["gnttab: use per-VCPU
maptrack free lists"], and a read side barrier has never been there).

Make the writer publish a new table page before limit (for bounds
checks to work), and new list head last (for racing maptrack_entry()
invocations to work). At the same time add read barriers to lockless
readers.

Additionally get_maptrack_handle() must not assume ->maptrack_head to
not change behind its back: Another handle may be put (updating only
->maptrack_tail) and then got or stolen (updating ->maptrack_head).

This is part of XSA-218.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4b78efa91c8ae3c42e14b8eaeaad773c5eb3b71a
master date: 2017-06-20 14:34:34 +0200

gnttab: Avoid potential double-put of maptrack entry

Each grant mapping for a particular domain is tracked by an in-Xen
"maptrack" entry.  This entry is is referenced by a "handle", which is
given to the guest when it calls gnttab_map_grant_ref().

There are two types of mapping a particular handle can refer to:
GNTMAP_host_map and GNTMAP_device_map.  A given
gnttab_unmap_grant_ref() call can remove either only one or both of
these entries.  When a particular handle has no entries left, it must
be freed.

gnttab_unmap_grant_ref() loops through its grant unmap request list
twice.  It first removes entries from any host pagetables and (if
appropraite) iommus; then it does a single domain TLB flush; then it
does the clean-up, including telling the granter that entries are no
longer being used (if appropriate).

At the moment, it's during the first pass that the maptrack flags are
cleared, but the second pass that the maptrack entry is freed.

Unfortunately this allows the following race, which results in a
double-free:

A: (pass 1) clear host_map
B: (pass 1) clear device_map
A: (pass 2) See that maptrack entry has no mappings, free it
B: (pass 2) See that maptrack entry has no mappings, free it #

Unfortunately, unlike the active entry pinning update, we can't simply
move the maptrack flag changes to the second half, because the
maptrack flags are used to determine if iommu entries need to be
added: a domain's iommu must never have fewer permissions than the
maptrack flags indicate, or a subsequent map_grant_ref() might fail to
add the necessary iommu entries.

Instead, free the maptrack entry in the first pass if there are no
further mappings.

This is part of XSA-218.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: b7f6cbb9d43f7384e1f38f8764b9a48216c8a525
master date: 2017-06-20 14:33:13 +0200

gnttab: fix unmap pin accounting race

Once all {writable} mappings of a grant entry have been unmapped, the
hypervisor informs the guest that the grant entry has been released by
clearing the _GTF_{reading,writing} usage flags in the guest's grant
table as appropriate.

Unfortunately, at the moment, the code that updates the accounting
happens in a different critical section than the one which updates the
usage flags; this means that under the right circumstances, there may be
a window in time after the hypervisor reported the grant as being free
during which the grant referee still had access to the page.

Move the grant accounting code into the same critical section as the
reporting code to make sure this kind of race can't happen.

This is part of XSA-218.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 9a0bd460cfc28564d39fa23541bb872b13e7f7ea
master date: 2017-06-20 14:32:03 +0200

x86/mm: disallow page stealing from HVM domains

The operation's success can't be controlled by the guest, as the device
model may have an active mapping of the page. If we nevertheless
permitted this operation, we'd have to add further TLB flushing to
prevent scenarios like

"Domains A (HVM), B (PV), C (PV); B->target==A
Steps:
1. B maps page X from A as writable
2. B unmaps page X without a TLB flush
3. A sends page X to C via GNTTABOP_transfer
4. C maps page X as pagetable (potentially causing a TLB flush in C,
but not in B)

At this point, X would be mapped as a pagetable in C while being
writable through a stale TLB entry in B."

A similar scenario could be constructed for A using XENMEM_exchange and
some arbitrary PV domain C then having this page allocated.

This is XSA-217.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: fae7d5be8bb8b7a5b7005c4f3b812a47661a721e
master date: 2017-06-20 14:29:51 +0200

Makefile: Provide way to ship livepatch test files

In the toplevel Makefile, provide build-tests and install-tests
targets which descend into xen/test.  (dist-tests is provided
automatically by the pattern rule, as is the convention here.)

We have to set BASEDIR ourselves, and use these curious runes, because
the convention in Makefiles under xen/ is to "make -f Rules.mk" with
BASEDIR set and to expect Rules.mk to reinvoke the per-directory
Makefile.  (This is really very strange.)  Normally this invocation
pattern is organised by the machinery in xen/Makefile (which sets
BASEDIR) and Rules.mk, but we need to invoke it from outside that
context.

In theory it would be nice to have a pattern rule %-tests.  But this
is not the style in the rest of the toplevel Makefile; and doing that
might interfere with the dist-% pattern rule.

None of this is invoked by default.  If install-tests or dist-tests is
requested, the livepatches (the only current output from xen/tests)
are shipped in DESTDIR/usr/lib/debug/xen-livepatch/.

This allows CI systems such as osstest which are trying to consume
this to arrange for the files to be built, and output, without them
having to have special knowledge of the details of Xen's build system.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit c55667bd0ad8f04688abfd5c6317709dc00f88ab)
(cherry picked from commit 2fb72b56bf0423e4b625cebc40e2dc7ea7734986)

xen/test/livepatch: Add xen_nop.livepatch to .gitignore

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 5225e70ccbb02f786647eddfeb65d6db1e230782)
(cherry picked from commit c40237fb500be05aee6869f5c2da5aa6f1a86f76)

xen/test/livepatch: Regularise Makefiles

In xen/test/livepatch/Makefile:

  Provide a `build' target, as most of the
  subdir-invoking Makefiles elsewhere expect.

In xen/test/Makefile:

  Replace the two open-coded targets with a generalised pattern rule
  which descends into each of SUBDIRS.  This allows `install' to work
  too (it is already supported by xen/test/livepatch/Makefile).

  Provide an explicit default target of `tests', and an `all' target
  (which is conventional).

  Suppress entry into the xen/test/livepatch subdir when we are
  building for i386, since the 32-bit hypervisor is not supported any
  more and we can't build livepatches for it either.

After this, the xen/test subdirectory is somewhere were make can be
invoked in the way which is conventional for xen.git/xen/ subdirs.

None of this is yet invoked from the top-level Makefile.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit e541982dc21dcc5be61279d22d477ed5c0bc41c5)
(cherry picked from commit 8c60e5f4327583b1c90cf3641388ef618c1727ec)

xen/test/livepatch/Makefile: Install in DESTDIR/usr/lib/debug/xen-livepatch

Dumping these patch files in /usr/lib/debug/xen-*.livepatch is a bit
ugly.

Also, refactor the Makefile to have a LIVEPATCHES variable, to reduce
repetition.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit a38d1af5fb02bee68c9a30e38b97c6129815f943)
(cherry picked from commit d32607042bdc944aa23b7eaf17cd68c8536d3a11)

xen/arm: p2m: Fix incorrect mapping of superpages

The same set of functions is used to set as well as to clean P2M
entries, except for clean operations (INVALID_MFN ~0UL) is passed as a
parameter. Unfortunately, when calculating an appropriate target order
for a particular mapping INVALID_MFN is taken into account which leads
to 4K page target order being set each time even for 2MB and 1GB
mappings.

This will result to break down the superpage into 4K mappings and leave
empty tables allocated.

This was introduced by commit 2ef3e36ec7 "xen/arm: p2m: Introduce
p2m_set_entry and __p2m_set_entry".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: 3fafdc28eb98dc1cb686379d83270516fc38049d

vgic: refuse irq migration when one is already in progress

When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.

This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.

It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master-commit-id: 91ef7364933037b6a8d825405a1c809a72e6152f

arm: remove irq from inflight, then change physical affinity

This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.

When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master-commit-id: 31bc6a93a096bab21211e0e2c7c284ee4aec5402

xen/arm: Survive unknown traps from guests

Currently we crash Xen if we see an ESR_EL2.EC value we don't recognise.
As configurable disables/enables are added to the architecture
(controlled by RES1/RESO bits respectively), with associated synchronous
exceptions, it may be possible for a guest to trigger exceptions with
classes that we don't recognise.

While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0487A.k_iss10775, page
D7-1937, EC values within the range 0x00 - 0x2c are reserved for future
use with synchronous exceptions, and EC within the range 0x2d - 0x3f may
be used for either synchronous or asynchronous exceptions.

The patch makes Xen handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the log.

This patch is based on Linux commit f050fe7a9164 "arm: KVM: Survive unknown
traps from the guest".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: baf2950213e6a50801940643e2549a5baa21ad71