]> xenbits.xensource.com Git - people/sstabellini/xen-unstable.git/.git/log
people/sstabellini/xen-unstable.git/.git
5 years agoxen/arm: allow domUs to iomap reserved-memory regions stable-4.13
Stefano Stabellini [Tue, 3 Dec 2019 02:32:19 +0000 (18:32 -0800)]
xen/arm: allow domUs to iomap reserved-memory regions

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: versal: add EEMI layer support
Izhar Ameer Shaikh [Tue, 3 Dec 2019 02:40:40 +0000 (18:40 -0800)]
platform: versal: add EEMI layer support

This patch adds a support for PM EEMI API mediate layer support.

Mapping between device, clock and reset nodes and corresponding base
addresses is derived from topology information. Similar to ZU+, certain
device nodes do not allow any operations such as turning off ACPU cores,
LPD etc.

Since there are a few significant changes to the handling of PM commands
for versal due to various reasons (node value representations,
additions/removal of commands etc.), there is a separate handler for
versal platform.

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: zynqmp: add a common EEMI header
Izhar Ameer Shaikh [Tue, 3 Dec 2019 02:37:14 +0000 (18:37 -0800)]
platform: zynqmp: add a common EEMI header

This patch adds a new common header to be used for generic PM
EEMI definitions. In addition, header guards are also added to
xilinx-zynqmp-mm.h and xilinx-zynqmp-eemi.h files.

Following unused emums are also removed:
 - pm_node_id
 - pm_request_ack
 - pm_abort_reason
 - pm_suspend_reason
 - pm_ram_state
 - pm_opchar_type

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: zynqmp: correct typos in comments
Izhar Ameer Shaikh [Fri, 30 Aug 2019 23:32:33 +0000 (16:32 -0700)]
platform: zynqmp: correct typos in comments

Fixed minor typos in comments.

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: zynqmp: rename clock node macros
Izhar Ameer Shaikh [Fri, 30 Aug 2019 23:32:32 +0000 (16:32 -0700)]
platform: zynqmp: rename clock node macros

To maintain future compatibility, rename clock node macros to have
PM_CLK_* prefix instead of previously used PM_CLOCK_* prefix.

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: zynqmp: rename reset node macros
Izhar Ameer Shaikh [Fri, 30 Aug 2019 23:32:31 +0000 (16:32 -0700)]
platform: zynqmp: rename reset node macros

To maintain future compatibility, rename reset node macros to have
PM_RST_* prefix instead of previously used PM_RESET_* prefix.

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoplatform: zynqmp: rename device node macros
Izhar Ameer Shaikh [Fri, 30 Aug 2019 23:32:30 +0000 (16:32 -0700)]
platform: zynqmp: rename device node macros

To maintain future compatibility, rename device node macros to have
PM_DEV_* prefix instead of previously used NODE_* prefix.

Signed-off-by: Izhar Ameer Shaikh <izhar.ameer.shaikh@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoxen: add a separate platform file for Versal
Stefano Stabellini [Mon, 15 Jul 2019 19:39:59 +0000 (12:39 -0700)]
xen: add a separate platform file for Versal

Let all the EEMI calls to go through for Dom0. Block access for domUs.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen: mediate EEMI TCM calls
Stewart Hildebrand [Fri, 31 May 2019 20:26:13 +0000 (13:26 -0700)]
xen: mediate EEMI TCM calls

It is necessary to allow a DomU to issue EEMI power management
operations on TCM nodes when running OpenAMP in a DomU. Introduce the
TCM nodes in xilinx-zynqmp-eemi.c, so that they are allowed to do so
when the TCM regions are assigned to the domU (they are subject to the
usual permissions checks.)

Signed-off-by: Stewart Hildebrand <Stewart.Hildebrand@dornerworks.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
5 years agoplatform: zynqmp: Map missing clocks to respective node
Tejas Patel [Mon, 25 Mar 2019 08:59:42 +0000 (01:59 -0700)]
platform: zynqmp: Map missing clocks to respective node

Dom0 requires access of AMS_REF, TOPSW_LSBUS and LPD_LSBUS clock.
Map these clocks to respective node to provide access
if Dom0 has permission to access to those nodes.

Signed-off-by: Tejas Patel <tejas.patel@xilinx.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agos/xen,shared-memory/xen,shared-memory-v1/g
Stefano Stabellini [Wed, 13 Mar 2019 19:33:08 +0000 (12:33 -0700)]
s/xen,shared-memory/xen,shared-memory-v1/g

The shared memory device tree binding went upstream as
"xen,shared-memory-v1". So, rename all occurrences of
"xen,shared-memory" to "xen,shared-memory-v1" in the docs.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/docs: improve reserved-memory doc
Stefano Stabellini [Thu, 7 Mar 2019 19:27:00 +0000 (11:27 -0800)]
xen/docs: improve reserved-memory doc

Extend the device tree snippet example in the docs to have a memory
node that covers the reserved-memory range as required by the device
tree spec.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/libxc: don't change xc_domain_memory_mapping
Stefano Stabellini [Fri, 1 Mar 2019 17:28:28 +0000 (09:28 -0800)]
xen/libxc: don't change xc_domain_memory_mapping

Although libxc doesn't promise compatibility, xc_domain_memory_mapping
has been used by QEMU for years. Instead of changing the signature of
the function, introduce a new xc_domain_memory_mapping_cache which takes
the additional cacheability parameter. Leave the original
xc_domain_memory_mapping unmodified.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/docs: how to map a page between dom0 and domU using iomem
Stefano Stabellini [Tue, 26 Feb 2019 23:00:44 +0000 (15:00 -0800)]
xen/docs: how to map a page between dom0 and domU using iomem

Document how to use the iomem option to share a page between Dom0 and a
DomU.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agolibxl/xl: add cacheability option to iomem
Stefano Stabellini [Tue, 26 Feb 2019 23:00:28 +0000 (15:00 -0800)]
libxl/xl: add cacheability option to iomem

Parse a new cacheability option for the iomem parameter, it can be
"devmem" for device memory mappings, which is the default, or "memory"
for normal memory mappings.

Store the parameter in a new field in libxl_iomem_range.

Pass the cacheability option to xc_domain_memory_mapping.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
CC: ian.jackson@eu.citrix.com
CC: wei.liu2@citrix.com
5 years agolibxc: xc_domain_memory_mapping, handle cacheability
Stefano Stabellini [Tue, 26 Feb 2019 23:00:25 +0000 (15:00 -0800)]
libxc: xc_domain_memory_mapping, handle cacheability

Add an additional parameter to xc_domain_memory_mapping to pass
cacheability information. The same parameter values are the same for the
XEN_DOMCTL_memory_mapping hypercall (0 is device memory, 1 is normal
memory). Pass CACHEABILITY_DEVMEM by default -- no changes in behavior.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
CC: ian.jackson@eu.citrix.com
CC: wei.liu2@citrix.com
5 years agoxen: extend XEN_DOMCTL_memory_mapping to handle cacheability
Stefano Stabellini [Fri, 4 Jan 2019 20:47:02 +0000 (12:47 -0800)]
xen: extend XEN_DOMCTL_memory_mapping to handle cacheability

Reuse the existing padding field to pass cacheability information about
the memory mapping, specifically, whether the memory should be mapped as
normal memory or as device memory (this is what we have today).

Add a cacheability parameter to map_mmio_regions. 0 means device
memory, which is what we have today.

On ARM, map device memory as p2m_mmio_direct_dev (as it is already done
today) and normal memory as p2m_ram_rw.

On x86, return error if the cacheability requested is not device memory.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: export shared memory regions as reserved-memory on device tree
Stefano Stabellini [Tue, 29 Jan 2019 18:58:06 +0000 (10:58 -0800)]
xen/arm: export shared memory regions as reserved-memory on device tree

Shared memory regions need to be advertised to the guest. Fortunately, a
device tree binding for special memory regions already exist:
reserved-memory.

Add a reserved-memory node for each shared memory region, for both
owners and borrowers.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Add RPLL and VPLL-related clocks to pm_clock2node map
Mirela Simonovic [Tue, 23 Oct 2018 14:51:24 +0000 (16:51 +0200)]
xen/arm: zynqmp: Add RPLL and VPLL-related clocks to pm_clock2node map

Current clock driver in Linux for Zynq MPSoC controls the PLLs as if they
are clocks (using the clock rather than PLL EEMI API). Only RPLL and VPLL
could be directly controlled by a guest that owns the display port, because
the display port driver in Linux requires for video and audio some special
clock frequencies, that further require VPLL and RPLL to be locked in fractional
modes (for video and audio respectively). Therefore, we need to allow a guest
that owns the display port to directly control these PLL-related clocks.

In future, Linux driver should switch to using PLL EEMI API for controlling
PLLs, and the support for that is already added in EEMI mediator in Xen.
Once that happens, this patch can be reverted.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Remove direct accesses to PLLs and their resets
Mirela Simonovic [Tue, 23 Oct 2018 14:51:23 +0000 (16:51 +0200)]
xen/arm: zynqmp: Remove direct accesses to PLLs and their resets

Only a limited number of PLLs can be controlled by guests, and that
has to be done using PLL EEMI APIs. Clean-up the direct access options
for PLLs and their resets.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
5 years agoxen/arm: zynqmp: Remove MMIO r/w accesses to clock and PLL control
Mirela Simonovic [Tue, 23 Oct 2018 14:51:22 +0000 (16:51 +0200)]
xen/arm: zynqmp: Remove MMIO r/w accesses to clock and PLL control

Guests need to used clock/PLL EEMI API calls to query and control
states of clocks/PLLs rather than MMIO read/write accesses. Thereby,
the gate for MMIO read/write accesses to clock/PLL control registers
has to be closed (done in this patch).

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Acked-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Add PLL set mode/parameter EEMI API
Mirela Simonovic [Tue, 23 Oct 2018 14:51:21 +0000 (16:51 +0200)]
xen/arm: zynqmp: Add PLL set mode/parameter EEMI API

PLL set mode/parameter should be allowed only for VPLL and RPLL to
a guest which uses the display port. This is the case because the display
port driver requires some very specific frequencies for video and audio,
so it relies on configuring VPLL and RPLL in fractional mode (for video
and audio respectively). These two PLLs are reserved for exclusive usage
to display port, or to be more specific - the clock framework of the guest
that owns the display port will need to directly control the modes of these
two PLLs and the power management framework should allow that.
The check is implemented using the domain_has_node_access() function, which
covers this use-case because access to NODE_VPLL and NODE_RPLL is granted to
a guest which has access to the display port via the newly added entries
in pm_node_access map.
If a guest is allowed to control a PLL the request is passed through to
the EL3. Otherwise, an error is returned.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Add PLL EEMI API definitions and passthrough get functions
Mirela Simonovic [Tue, 23 Oct 2018 14:51:20 +0000 (16:51 +0200)]
xen/arm: zynqmp: Add PLL EEMI API definitions and passthrough get functions

PLL get functions should be allowed to every guest because guests may
need to use these APIs to calculate the PLL output frequency. Thereby,
allow passthrough of get functions to every guest.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Implement checking and passthrough for clock control APIs
Mirela Simonovic [Tue, 23 Oct 2018 14:51:19 +0000 (16:51 +0200)]
xen/arm: zynqmp: Implement checking and passthrough for clock control APIs

Clock enable, disable, set parent and set divider EEMI APIs affect
frequency of the target clock, so there has to be a permission checking
to filter the calls that should not be permitted to a guest. To implement
the checking, the clock-to-node map is introduced and implemented using the
pm_clock2node array (note that a clock can drive several nodes). Elements
of pm_clock2node array have to be defined by the increasing clock ID values
because of the search algorithm that relies on this assumption. Only clocks that
a guest could be allowed to control need to be represented in the pm_clock2node
array. Clocks that are not represented in the array should not be controllable
by any guest.
A guest will be granted the permission to control a clock only if all the nodes
driven by the target clock are owned by the guest. If the permission is granted
the call is passed through to the EL3. Otherwise, error is returned.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoxen/arm: zynqmp: Clock get EEMI API functions are allowed to each guest
Mirela Simonovic [Tue, 23 Oct 2018 14:51:18 +0000 (16:51 +0200)]
xen/arm: zynqmp: Clock get EEMI API functions are allowed to each guest

Each guest is allowed to query clock related information (get divisor
value, current clock parent or clock status). Guests may need to use
these APIs to construct the information about the partial clock tree
that they control or depend on - e.g. although a guest may not control
a clock it may need to calculate its frequency and these APIs are
necessary to enable the calculation.
If the provided clock ID is valid, the call is passed through to the
EL3. Otherwise, an error is returned.

The clock id definitions are added in this patch. Although this patch
requires only clock id min and max values to check if clock id is valid,
the clock id definitions are in general needed in Xen. This is because
Xilinx clock driver implementation in Linux queries the clock tree topology
at runtime from firmware (ATF). Device tree does contain some information
about the clocks - but only leaf clock ID numbers and their binding to device
interfaces. The underlying software layers need to know everything else about
the clock tree. Since the clock topology resides in ATF and querying calls
are passed through by Xen, the Xen at least needs to know about clock IDs to
be able to map them to nodes in order to determine clock-control permissions.
This clock-control permission checking will be added in a following patch.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
5 years agoxen/arm: zynqmp: Return not supported error for clock get/set rate API
Mirela Simonovic [Tue, 23 Oct 2018 14:51:17 +0000 (16:51 +0200)]
xen/arm: zynqmp: Return not supported error for clock get/set rate API

Clock get/set rate EEMI API should be implemented and mapped to clock
divisor, multiplexer, and gate related EEMI APIs by guests.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Reviewed-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/arm: zynqmp: Fix power management status/error codes
Mirela Simonovic [Tue, 23 Oct 2018 14:51:16 +0000 (16:51 +0200)]
xen/arm: zynqmp: Fix power management status/error codes

Power management error codes were recently fixed in ATF and aligned
with PMU-FW definitions. Do the same for Xen.

Signed-off-by: Mirela Simonovic <mirela.simonovic@aggios.com>
Reviewed-by: Saeed Nowshadi <saeedn@xilinx.com>
Acked-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agoxen/eemi: proper bounds checks
Stefano Stabellini [Mon, 24 Sep 2018 23:07:33 +0000 (16:07 -0700)]
xen/eemi: proper bounds checks

ARRAY_SIZE(pm_node_access) and ARRAY_SIZE(pm_reset_access) are out of
bounds for indexes. addr == end in pm_mmio_access is also not valid.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoxen: match VTCR_EL2 SL0 attribute in TTBCR
Stefano Stabellini [Thu, 6 Sep 2018 17:52:41 +0000 (10:52 -0700)]
xen: match VTCR_EL2 SL0 attribute in TTBCR

The SL0 attribute in TTBCR, which specifies the entry level in the page
table lookup, should be the same as the SL0 attribute in VTCR_EL2, given
that pagetables are shared between MMU and SMMU.

Make it so, by reading the value from VTCR_EL2, and setting reg
accordingly.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Tested-by: Upender Cherukupally <upender@xilinx.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoxen: platform: zynqmp: Add new eemi api IDs
Tejas Patel [Sat, 24 Feb 2018 15:47:11 +0000 (07:47 -0800)]
xen: platform: zynqmp: Add new eemi api IDs

New EEMI API IDs are added in ATF and Linux.
Sync EEMI API IDs of xen with Linux and ATF.

Signed-off-by: Tejas Patel <tejasp@xilinx.com>
Acked-by: Jolly Shah <jollys@xilinx.com>
Reviewed-by: Alistair Francis <alistair.francis@xilinx.com>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoarch/arm64: zynqmp: Allow MMIO access to the CRF audio register
Alistair Francis [Thu, 16 Nov 2017 22:27:15 +0000 (14:27 -0800)]
arch/arm64: zynqmp: Allow MMIO access to the CRF audio register

Allow the guest to access the R_CRF_DP_AUDIO_REF_CTRL register. This
fixes warm reboot issues with newer kernels.

Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoxen/arm: zynqmp: Use the USB XHCI areas to determine EEMI perms
Edgar E. Iglesias [Sun, 26 Mar 2017 21:37:13 +0000 (23:37 +0200)]
xen/arm: zynqmp: Use the USB XHCI areas to determine EEMI perms

Use the USB XHCI areas to determine EEMI perms.

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoxen/arm64: zynqmp: Regenerate LPD memmap
Edgar E. Iglesias [Sun, 26 Mar 2017 21:00:38 +0000 (23:00 +0200)]
xen/arm64: zynqmp: Regenerate LPD memmap

Regenerate LPD memmap.

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agoxen/arm: zynqmp: Forward plaform specific firmware calls
Edgar E. Iglesias [Mon, 30 Jan 2017 16:36:42 +0000 (17:36 +0100)]
xen/arm: zynqmp: Forward plaform specific firmware calls

Implement an EEMI mediator and forward platform specific
firmware calls from guests to firmware.

The EEMI mediator is responsible for implementing access
controls modifying or blocking calls that try to operate
on setup for devices that are not under the calling guest's
control.

EEMI:
https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
5 years agodocs: documentation about static shared memory regions
Zhongze Liu [Wed, 5 Dec 2018 22:16:02 +0000 (14:16 -0800)]
docs: documentation about static shared memory regions

Author: Zhongze Liu <blackskygg@gmail.com>

Add docs to document the motivation, usage, use cases and other
relevant information about the static shared memory feature.

This is for the proposal "Allow setting up shared memory areas between VMs
from xl config file". See:

  https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

The corresponding device tree binding is described by
Documentation/devicetree/bindings/reserved-memory/xen,shared-memory.txt.

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: xen-devel@lists.xen.org
5 years agolibxl:xl: add parsing code to parse "libxl_static_sshm" from xl config files
Zhongze Liu [Wed, 5 Dec 2018 22:16:01 +0000 (14:16 -0800)]
libxl:xl: add parsing code to parse "libxl_static_sshm" from xl config files

Add the parsing utils for the newly introduced libxl_static_sshm struct
to the libxl/libxlu_* family. And add realated parsing code in xl to
parse the struct from xl config files. This is for the proposal "Allow
setting up shared memory areas between VMs from xl config file" (see [1]).

[1] https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: xen-devel@lists.xen.org
5 years agolibxl: support unmapping static shared memory areas during domain destruction
Zhongze Liu [Wed, 5 Dec 2018 22:16:00 +0000 (14:16 -0800)]
libxl: support unmapping static shared memory areas during domain destruction

Add libxl__sshm_del to unmap static shared memory areas mapped by
libxl__sshm_add during domain creation. The unmapping process is:

* For a owner: decrease the refcount of the sshm region, if the refcount
  reaches 0, cleanup the whole sshm path.

* For a borrower:
  1) unmap the shared pages, and cleanup related xs entries. If the
     system works normally, all the shared pages will be unmapped, so there
     won't be page leaks. In case of errors, the unmapping process will go
     on and unmap all the other pages that can be unmapped, so the other
     pages won't be leaked, either.
  2) Decrease the refcount of the sshm region, if the refcount reaches
     0, cleanup the whole sshm path.

This is for the proposal "Allow setting up shared memory areas between VMs
from xl config file" (see [1]).

[1] https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: xen-devel@lists.xen.org
5 years agolibxl: support mapping static shared memory areas during domain creation
Zhongze Liu [Tue, 29 Jan 2019 18:57:11 +0000 (10:57 -0800)]
libxl: support mapping static shared memory areas during domain creation

Add libxl__sshm_add to map shared pages from one DomU to another, The mapping
process involves the following steps:

  * Set defaults and check for further errors in the static_shm configs:
    overlapping areas, invalid ranges, duplicated owner domain,
    not page aligned, no owner domain etc.
  * Use xc_domain_add_to_physmap_batch to map the shared pages to borrowers
  * When some of the pages can't be successfully mapped, roll back any
    successfully mapped pages so that the system stays in a consistent state.
  * Write information about static shared memory areas into the appropriate
    xenstore paths and set the refcount of the shared region accordingly.

Temporarily mark this as unsupported on x86 because calling p2m_add_foreign on
two domU's is currently not allowd on x86 (see the comments in
x86/mm/p2m.c:p2m_add_foreign for more details).

This is for the proposal "Allow setting up shared memory areas between VMs
from xl config file" (see [1]).

[1] https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agolibxl: introduce a new structure to represent static shared memory regions
Zhongze Liu [Tue, 29 Jan 2019 18:56:08 +0000 (10:56 -0800)]
libxl: introduce a new structure to represent static shared memory regions

Add a new structure to the IDL family to represent static shared memory regions
as proposed in the proposal "Allow setting up shared memory areas between VMs
from xl config file" (see [1]).

[1] https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: xen-devel@lists.xen.org
5 years agoxen: xsm: flask: introduce XENMAPSPACE_gmfn_share for memory sharing
Zhongze Liu [Wed, 5 Dec 2018 22:15:57 +0000 (14:15 -0800)]
xen: xsm: flask: introduce XENMAPSPACE_gmfn_share for memory sharing

The existing XENMAPSPACE_gmfn_foreign subop of XENMEM_add_to_physmap forbids
a Dom0 to map memory pages from one DomU to another, which restricts some useful
yet not dangerous use cases -- such as sharing pages among DomU's so that they
can do shm-based communication.

This patch introduces XENMAPSPACE_gmfn_share to address this inconvenience,
which is mostly the same as XENMAPSPACE_gmfn_foreign but has its own xsm check.

Specifically, the patch:

* Introduces a new av permission MMU__SHARE_MEM to denote if two domains can
  share memory by using the new subop;
* Introduces xsm_map_gmfn_share() to check if (current) has proper permission
  over (t) AND MMU__SHARE_MEM is allowed between (d) and (t);
* Modify the default xen.te to allow MMU__SHARE_MEM for normal domains that
  allow grant mapping/event channels.

The new subop is marked unsupported for x86 because calling p2m_add_foregin
on two DomU's is currently not supported on x86.

This is for the proposal "Allow setting up shared memory areas between VMs
from xl config file" (see [1]).

[1] https://lists.xen.org/archives/html/xen-devel/2017-08/msg03242.html

Signed-off-by: Zhongze Liu <blackskygg@gmail.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tim Deegan <tim@xen.org>
Cc: xen-devel@lists.xen.org
5 years agoautomation: increase tests maximum time from 10s to 30s
Roger Pau Monne [Tue, 3 Dec 2019 10:33:52 +0000 (11:33 +0100)]
automation: increase tests maximum time from 10s to 30s

10s is too low for the clang tests, this is the output from a clang
test:

  (XEN) [    6.512748] ***************************************************
  (XEN) [    6.513323] SELFTEST FAILURE: CORRECT BEHAVIOR CANNOT BE GUARANTEED
  (XEN) [    6.513891] ***************************************************
  (XEN) [    6.514469] 3... 2... 1...
  (XEN) [    9.520011] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
  (XEN) [    9.544319] Freed 488kB init memory
  --- Xen Test Framework ---
  Environment: HVM 32bit (PAE 3 levels)
  Hello World
  Test result: SUCCESS
  (XEN) [    9.610977] Hardware Dom0 halted: halting machine

As can be seen from the output above booting Xen and the XTF test
takes ~10s, without accounting for the time it takes for QEMU to
initialize.

Increase the timeout to 30s to be on the safe side.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoautomation: add timestamps to Xen tests
Roger Pau Monne [Tue, 3 Dec 2019 10:33:51 +0000 (11:33 +0100)]
automation: add timestamps to Xen tests

Enable Xen timestamps in the automated Xen tests, this is helpful in
order to figure out if Xen is stuck or just slow in the automated
tests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/psr: fix bug which may cause crash
Yi Sun [Mon, 2 Dec 2019 07:24:48 +0000 (15:24 +0800)]
x86/psr: fix bug which may cause crash

During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************

The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
    +------------------+------------------+--------------+
    | Data default val | Code default val | MBA throttle |
    +------------------+------------------+--------------+

Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoRationalize max_grant_frames and max_maptrack_frames handling
George Dunlap [Fri, 29 Nov 2019 17:24:45 +0000 (17:24 +0000)]
Rationalize max_grant_frames and max_maptrack_frames handling

Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.

Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:

- The python libxc bindings hard-coded these values to 32 and 1024,
  respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
  calculation for grant frames: either 32 or 64, based on the max
  possible mfn.

These defaults interact poorly with the hypervisor command-line limit:

- The hypervisor command-line limit cannot be used to raise the limit
  for all guests anymore, as the default in the toolstack will
  effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
  then the "default" values generated by the toolstack are too high,
  and all guest creations will fail.

In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.

In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.

This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values.  It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.

NOTE: - The Ocaml bindings require the caller to always specify a value,
        and the code to start a xenstored stubdomain hard-codes these to 4
and 128 respectively; this behavour will not be modified.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit f2ae59bc4b9b5c3f12de86aa42cdf413d2c3ffbf)

5 years agox86 / iommu: set up a scratch page in the quarantine domain
Paul Durrant [Wed, 27 Nov 2019 17:11:43 +0000 (17:11 +0000)]
x86 / iommu: set up a scratch page in the quarantine domain

This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.

The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.

NOTE: These modifications are restricted to x86 implementations only as
      the buggy h/w I am aware of is only used with Xen in an x86
      environment. ARM may require similar code but, since I am not
      aware of the need, this patch does not modify any ARM implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoconsole: avoid buffer overrun in guest_console_write()
Jan Beulich [Fri, 29 Nov 2019 16:20:06 +0000 (17:20 +0100)]
console: avoid buffer overrun in guest_console_write()

conring_puts() has been requiring a nul-terminated string, which the
local kbuf[] doesn't get set for anymore. Add a length parameter to the
function, just like was done for others, thus allowing embedded nul to
also be read through XEN_SYSCTL_readconsole.

While there drop a stray cast: Both operands of - are already uint32_t.

Fixes: ea601ec9995b ("xen/console: Rework HYPERCALL_console_io interface")
Reported-by: Jürgen Groß <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
master commit: 0ef3ad971275c30355245299998faddfada51726
master date: 2019-11-29 17:09:16 +0100

5 years agoconsole: avoid buffer overflow in guest_console_write()
Jan Beulich [Fri, 29 Nov 2019 16:18:33 +0000 (17:18 +0100)]
console: avoid buffer overflow in guest_console_write()

The switch of guest_console_write()'s second parameter from plain to
unsigned int has caused the function's main loop header to no longer
guard the min_t() use within the function against effectively negative
values, due to the casts hidden inside the macro. Replace by a plain
min(), casting one of the arguments as necessary.

Fixes: ea601ec9995b ("xen/console: Rework HYPERCALL_console_io interface")
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
master commit: aaf8839fdf8b9b1a93a3837b82f680adea1b297c
master date: 2019-11-29 17:08:20 +0100

5 years agoTurn off debug in preparation for 4.13 release.
Ian Jackson [Fri, 29 Nov 2019 14:55:45 +0000 (14:55 +0000)]
Turn off debug in preparation for 4.13 release.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agox86/svm: Write the correct %eip into the outgoing task
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/svm: Write the correct %eip into the outgoing task

The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length.  As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.

This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).

32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.

In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/svm: Always intercept ICEBP
Andrew Cooper [Mon, 25 Nov 2019 19:33:36 +0000 (19:33 +0000)]
x86/svm: Always intercept ICEBP

ICEBP isn't handled well by SVM.

The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it.  As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.

To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.

We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).

Unconditionally intercept ICEBP.  This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit.  This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.

This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vtx: Fix fault semantics for early task switch failures
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/vtx: Fix fault semantics for early task switch failures

The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:

 1) Early faults (i.e. ones delivered in the context of the old task) get
    delivered with trap semantics, and break restartibility.

 2) The addition isn't truncated to 32 bits.  In the corner case of a task
    switch instruction crossing the 4G->0 boundary taking an early fault (with
    trap semantics), a VMEntry failure will occur due to %rip being out of
    range.

Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.

For now, pass 0 on the SVM side.  This highlights a separate preexisting bug
which will be addressed in the following patch.

While adjusting call sites, drop the unnecessary uint16_t cast.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agobuild: provide option to disambiguate symbol names
Jan Beulich [Thu, 28 Nov 2019 16:47:25 +0000 (17:47 +0100)]
build: provide option to disambiguate symbol names

The .file assembler directives generated by the compiler do not include
any path components (gcc) or just the ones specified on the command line
(clang, at least version 5), and hence multiple identically named source
files (in different directories) may produce identically named static
symbols (in their kallsyms representation). The binary diffing algorithm
used by xen-livepatch, however, depends on having unique symbols.

Make the ENFORCE_UNIQUE_SYMBOLS Kconfig option control the (build)
behavior, and if enabled use objcopy to prepend the (relative to the
xen/ subdirectory) path to the compiler invoked STT_FILE symbols. Note
that this build option is made no longer depend on LIVEPATCH, but merely
defaults to its setting now.

Conditionalize explicit .file directive insertion in C files where it
exists just to disambiguate names in a less generic manner; note that
at the same time the redundant emission of STT_FILE symbols gets
suppressed for clang. Assembler files as well as multiply compiled C
ones using __OBJECT_FILE__ are left alone for the time being.

Since we now expect there not to be any duplicates anymore, also don't
force the selection of the option to 'n' anymore in allrandom.config.
Similarly COVERAGE no longer suppresses duplicate symbol warnings if
enforcement is in effect, which in turn allows
SUPPRESS_DUPLICATE_SYMBOL_WARNINGS to simply depend on
!ENFORCE_UNIQUE_SYMBOLS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/IRQ: make internally used IRQs also honor the pending EOI stack
Jan Beulich [Thu, 28 Nov 2019 14:14:03 +0000 (15:14 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack

At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).

The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.

Notes:

- In principle we could get away without the check_eoi_deferral flag.
  I've introduced it just to make sure there's as little change as
  possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
  strictly necessary.
- The new function's name isn't very helpful with its use in
  end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
  parallel ack_APIC_irq()), but then liked this even less.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vmx: always sync PIR to IRR before vmentry
Roger Pau Monné [Thu, 28 Nov 2019 10:58:25 +0000 (11:58 +0100)]
x86/vmx: always sync PIR to IRR before vmentry

When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.

Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.

Reported-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Joe Jin <joe.jin@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoMAINTAINERS: Update path to the livepatch documentation
Julien Grall [Tue, 26 Nov 2019 13:30:23 +0000 (13:30 +0000)]
MAINTAINERS: Update path to the livepatch documentation

Commit d661611d08 "docs/markdown: Switch to using pandoc, and fix
underscore escaping" converted the livepatch documentation from markdown
to pandoc.

Update MAINTAINERS to reflect the change so the correct maintainers are
CCed to the patches.

Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping")
Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/microcode: refuse to load the same revision ucode
Sergey Dyasli [Wed, 27 Nov 2019 10:04:30 +0000 (10:04 +0000)]
x86/microcode: refuse to load the same revision ucode

Currently if a user tries to live-load the same or older ucode revision
than CPU already has, he will get a single message in Xen log like:

    (XEN) 128 cores are to update their microcode

No actual ucode loading will happen and this situation can be quite
confusing. Fix this by starting ucode update only when the provided
ucode revision is higher than the currently cached one (if any).
This is based on the property that if microcode_cache exists, all CPUs
in the system should have at least that ucode revision.

Additionally, print a user friendly message if no matching or newer
ucode can be found in the provided blob. This also requires ignoring
-ENODATA in AMD-side code, otherwise the message given to the user is:

    (XEN) Parsing microcode blob error -61

Which actually means that a ucode blob was parsed fine, but no matching
ucode was found.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoAMD/IOMMU: honour IR setting while pre-filling DTEs
Igor Druzhinin [Tue, 26 Nov 2019 17:08:19 +0000 (17:08 +0000)]
AMD/IOMMU: honour IR setting while pre-filling DTEs

IV bit shouldn't be set in DTE if interrupt remapping is not
enabled. It's a regression in behavior of "iommu=no-intremap"
option which otherwise would keep interrupt requests untranslated
for all of the devices in the system regardless of wether it's
described as valid in IVRS or not.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agodocs/xl: Document pci-assignable state
George Dunlap [Tue, 26 Nov 2019 15:49:20 +0000 (15:49 +0000)]
docs/xl: Document pci-assignable state

Changesets 319f9a0ba9 ("passthrough: quarantine PCI devices") and
ba2ab00bbb ("IOMMU: default to always quarantining PCI devices")
introduced PCI device "quarantine" behavior, but did not document how
the pci-assignable-add and -remove functions act in regard to this.
Rectify this.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Reviewed-by: Paul Durrant <paul@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoEFI: fix "efi=attr=" handling
Jan Beulich [Tue, 26 Nov 2019 13:17:45 +0000 (14:17 +0100)]
EFI: fix "efi=attr=" handling

Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().

Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/p2m-pt: fix (latent) page table mapping leak on do_recalc() error paths
Jan Beulich [Tue, 26 Nov 2019 13:17:11 +0000 (14:17 +0100)]
x86/p2m-pt: fix (latent) page table mapping leak on do_recalc() error paths

There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
Anthony PERARD [Tue, 26 Nov 2019 13:16:09 +0000 (14:16 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible

This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.

This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.

While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoIOMMU: default to always quarantining PCI devices
Jan Beulich [Tue, 26 Nov 2019 13:15:01 +0000 (14:15 +0100)]
IOMMU: default to always quarantining PCI devices

XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.

Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.

Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).

This is XSA-306.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
5 years agox86: Don't increase ApicIdCoreSize past 7
George Dunlap [Tue, 26 Nov 2019 10:32:42 +0000 (10:32 +0000)]
x86: Don't increase ApicIdCoreSize past 7

Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads.  This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one.  In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.

Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7.  (Linux
guests have so far continued to cope.)

A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches.  As a short-term fix,
limit this value to 7.

This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change.  This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Reported-by: Andreas Kinzler <hfp@posteo.de>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/mm: Adjust linear uses / entries when a page loses validation
George Dunlap [Fri, 22 Nov 2019 18:52:02 +0000 (18:52 +0000)]
x86/mm: Adjust linear uses / entries when a page loses validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both.  To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.

Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.

This could happen in one of two places:

1. In the case a partially-devalidated page was re-validated by
someone else

2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.

The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling.  Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.

Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: make default path to add/remove all PV devices
Oleksandr Grytsov [Thu, 21 Nov 2019 18:13:00 +0000 (20:13 +0200)]
libxl: make default path to add/remove all PV devices

Adding/removing device is handled for specific devices only: VBD, VIF,
QDISK. This commit adds default case to handle adding/removing for all PV
devices by default, except QDISK device, which requires special handling.
If any other device is required a special handling it should be done by
implementing separate case (similar to QDISK device). The default
behaviour for adding device is to wait when the backend goes to
XenbusStateInitWait and the default behaviour on removing device is to
start generic device remove procedure.

Also this commit fixes removing guest function: before the guest was
removed when all VIF and VBD devices are removed. The fix removes
guest when all created devices are removed. This is done by checking the
guest device list instead of checking num_vifs and num_vbds. num_vifs and
num_vbds variables are removed as redundant in this case.

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: introduce new backend type VINPUT
Oleksandr Grytsov [Thu, 21 Nov 2019 18:12:58 +0000 (20:12 +0200)]
libxl: introduce new backend type VINPUT

There are two kind of VKBD devices: with QEMU backend and user space PV
backend. In current implementation they can't be distinguished as both use
VKBD backend type. As result, user space PV KBD backend is started and
stopped as QEMU backend. This commit adds new device kind VINPUT to be
used as backend type for user space PV KBD backend.

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vvmx: Fix livelock with XSA-304 fix
Andrew Cooper [Thu, 21 Nov 2019 18:21:49 +0000 (18:21 +0000)]
x86/vvmx: Fix livelock with XSA-304 fix

It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.

Nested virt is experimental and not security supported, but is useful for
development purposes.  In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.

Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting.  Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.

When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.

Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/livepatch: Prevent patching with active waitqueues
Andrew Cooper [Tue, 5 Nov 2019 19:08:14 +0000 (19:08 +0000)]
x86/livepatch: Prevent patching with active waitqueues

The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true.  The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.

This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.

In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode
Roger Pau Monné [Fri, 22 Nov 2019 16:52:59 +0000 (17:52 +0100)]
x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode

Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.

This issue was discovered while trying to boot a pvshim with x2APIC
enabled.

Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen: Add missing va_end() in hypercall_create_continuation()
Julien Grall [Wed, 20 Nov 2019 13:37:51 +0000 (13:37 +0000)]
xen: Add missing va_end() in hypercall_create_continuation()

The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.

This was introduced by XSA-296.

Coverity-ID: 1488727
Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()")
Signed-off-by: Julien Grall <julien@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/e820: fix 640k - 1M region reservation logic
Sergey Dyasli [Wed, 30 Oct 2019 14:54:47 +0000 (14:54 +0000)]
x86/e820: fix 640k - 1M region reservation logic

Converting a guest from PV to PV-in-PVH makes the guest to have 384k
less memory, which may confuse guest's balloon driver. This happens
because Xen unconditionally reserves 640k - 1M region in E820 despite
the fact that it's really a usable RAM in PVH boot mode.

Fix this by skipping region type change in virtualised environments,
trusting whatever memory map our hypervisor has provided.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/boot: Cache cpu_has_hypervisor very early on boot
Andrew Cooper [Fri, 1 Nov 2019 20:07:31 +0000 (20:07 +0000)]
x86/boot: Cache cpu_has_hypervisor very early on boot

We cache Long Mode and No Execute early on boot, so take the opportunity to
cache HYPERVISOR early as well.

Replace opencoded early access to the feature bit.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/boot: Remove cached CPUID data from the trampoline
Andrew Cooper [Mon, 9 Sep 2019 10:43:33 +0000 (11:43 +0100)]
x86/boot: Remove cached CPUID data from the trampoline

We have a cached cpuid_ext_features in the trampoline which is kept in sync by
various pieces of boot logic.  This is complicated, and all it is actually
used for is to derive whether NX is safe to use.

Replace it with a canned value to load into EFER.

trampoline_setup() and efi_arch_cpu() now tweak trampoline_efer at the point
that they are stashing the main copy of CPUID data.  Similarly,
early_init_intel() needs to tweak if it has re-enabled the use of NX.

This simplifies the AP boot and S3 resume paths by using trampoline_efer
directly, rather than locally turning FEATURE_NX into EFER_NX.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/Makefile: remove $(guard) use from $(TARGET).efi target
Anthony PERARD [Wed, 20 Nov 2019 16:12:12 +0000 (17:12 +0100)]
x86/Makefile: remove $(guard) use from $(TARGET).efi target

Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
  nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.

But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway.  On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.

Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.

The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.

Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoefi: do not use runtime services table with efi=no-rs
Marek Marczykowski-Górecki [Wed, 20 Nov 2019 16:10:59 +0000 (17:10 +0100)]
efi: do not use runtime services table with efi=no-rs

Before dfcccc6631 "efi: use directmap to access runtime services table"
all usages of efi_rs pointer were guarded by efi_rs_enter(), which
implicitly refused to operate with efi=no-rs (by checking if
efi_l4_pgtable is NULL - which is the case for efi=no-rs). The said
commit (re)moved that call as unneeded for just reading content of
efi_rs structure - to avoid unnecessary page tables switch. But it
neglected to check if efi_rs access is legal.

Fix this by adding explicit check for runtime service being enabled in
the cases that do not use efi_rs_enter().

Reported-by: Roman Shaposhnik <roman@zededa.com>
Fixes: dfcccc6631 "efi: use directmap to access runtime services table"
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/cpuid: Fix Lisbon/Magny-Cours Opterons WRT SSSE3/SSE4A
Andrew Cooper [Tue, 19 Nov 2019 16:40:26 +0000 (16:40 +0000)]
x86/cpuid: Fix Lisbon/Magny-Cours Opterons WRT SSSE3/SSE4A

c/s ff66ccefe5 "x86/CPUID: adjust SSEn dependencies" made SSE4A depend on
SSSE3, but these processors really do have have SSE4A without SSSE3.

This manifests as an upgrade regression, as the SSE4A feature disappears from
view.

Adjust the SSE4A feature to depend on SSE3 rather than SSSE3.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoconfigure: Fix test for python 3.8
Anthony PERARD [Fri, 15 Nov 2019 16:15:32 +0000 (16:15 +0000)]
configure: Fix test for python 3.8

https://docs.python.org/3.8/whatsnew/3.8.html#debug-build-uses-the-same-abi-as-release-build

> To embed Python into an application, a new --embed option must be
> passed to python3-config --libs --embed to get -lpython3.8 (link the
> application to libpython). To support both 3.8 and older, try
> python3-config --libs --embed first and fallback to python3-config
> --libs (without --embed) if the previous command fails.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
[ wei: rerun autogen.sh ]

5 years agotools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim
Ian Jackson [Tue, 29 Oct 2019 17:45:30 +0000 (17:45 +0000)]
tools/configure: Honour XEN_COMPILE_ARCH and _TARGET_ for shim

The pvshim can only be built 64-bit because the hypervisor is only
64-bit nowadays.  The hypervisor build supports XEN_COMPILE_ARCH and
XEN_TARGET_ARCH which override the information from uname.  The pvshim
build runs out of the tools/ directory but calls the hypervisor build
system.

If one runs in a Linux 32-bit userland with a 64-bit kernel, one used
to be able to set XEN_COMPILE_ARCH.  But nowadays this does not work.
configure sees the target cpu as 64-bit and tries to build pvshim.
The build prints
  echo "*** Xen x86/32 target no longer supported!"
and doesn't build anything.  Then the subsequent Makefiles try to
install the non-built pieces.

Fix this anomaly by causing configure to honour the Xen hypervisor way
of setting the target architecture.

In principle this user behaviour is not handled quite right, because
configure will still see 64-bit and so all the autoconf-based
architecture testing will see 64-bit rather than 32-bit x86.  But the
tools are in fact generally quite portable: this particular location
in configure{.ac,} is the only place in tools/ where 64-bit x86 is
treated differently from 32-bit x86, so the fix is sufficient and
correct for this use case.

It remains the case that XEN_COMPILE_ARCH or XEN_TARGET_ARCH to a
non-x86 architecture, when configure thinks things are x86, or vice
versa, will not work right.

(This is a bugfix to 8845155c831c
  pvshim: make PV shim build selectable from configure
which inadvertantly deleted the logic to only build the shim for
XEN_TARGET_ARCH != x86_32.)

I have rerun autogen.sh, so this patch contains the fix to configure
as well as the source fix to configure.ac.

Fixes: 8845155c831c59e867ee3dd31ee63e0cc6c7dcf2
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Olaf Hering <olaf@aepfle.de>
CC: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Jürgen Groß <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
5 years agolibxl: gentypes: initialise array elements in json
Oleksandr Grytsov [Mon, 28 Oct 2019 18:22:16 +0000 (18:22 +0000)]
libxl: gentypes: initialise array elements in json

Currently, array elements are initialized with calloc.  Which means
initialize all element fields with zero values.  If an entry is not
present in the json (which is entirely permitted), the element will be
all-bits-zero instead of the default value (which is wrong).

The fix is to initalise each array element before parsing it, using
the new libxl_C_type_do_init function.

With existing types this results in a lot of new calls like this:

      for (i=0; (t=libxl__json_array_get(x,i)); i++) {
 +            libxl_sched_params_init(&p->vcpus[i]);
              rc = libxl__sched_params_parse_json(gc, t, &p->vcpus[i]);

(indentation adjusted).  This looks right.  To check what happens with
types which have nontrivial defaults but don't have init functions (of
which we currently have none in arrays), I (Ian) experimentally added:

      ("pnode", uint32), # physical node of this node
      ("vcpus", libxl_bitmap), # vcpus in this node
 +    ("sporks", Array(MemKB, "num_sporks")),
      ])

The result was this:

          for (i=0; (t=libxl__json_array_get(x,i)); i++) {
 +                p->sporks[i] = LIBXL_MEMKB_DEFAULT;
                  rc = libxl__uint64_parse_json(gc, t, &p->sporks[i]);

where the context was added by adding "sporks" and "+" indicates a
line added by this patch, "initialise array elements in json".

Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
---
v2 [iwj]: Use libxl_C_type_do_init.
          Reword commit message and discuss spork testing.

5 years agolibxl: gentypes.py: Break out libxl_C_type_do_init
Ian Jackson [Tue, 29 Oct 2019 15:19:33 +0000 (15:19 +0000)]
libxl: gentypes.py: Break out libxl_C_type_do_init

This is going to be the common way to initialise things.
_libxl_C_type_init remains the thing for generating the body of the
init function, and for some special cases.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agolibxl: gentypes.py: Break out field_pass in ..._copy_deprecated
Ian Jackson [Tue, 29 Oct 2019 15:17:58 +0000 (15:17 +0000)]
libxl: gentypes.py: Break out field_pass in ..._copy_deprecated

We are going to want this in a moment.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agotools/libxl: gentypes.py: Prefer init_val to init_fn
Ian Jackson [Tue, 29 Oct 2019 15:00:35 +0000 (15:00 +0000)]
tools/libxl: gentypes.py: Prefer init_val to init_fn

When both are provided, init_val is likely to be more direct.

No functional change with existing types: C output is identical.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
5 years agolibxl_pci: Don't hold QMP connection while waiting
Anthony PERARD [Thu, 31 Oct 2019 12:17:27 +0000 (12:17 +0000)]
libxl_pci: Don't hold QMP connection while waiting

After sending the 'device_del' command for a PCI passthrough device,
we wait until QEMU has effectively deleted the device, this involves
executing more QMP commands. While waiting, libxl hold the connection.

It isn't necessary to hold the connection and it prevents others from
making progress, so this patch releases the QMP connection.

For background:
    e.g., when a guest is created with several pci passthrough
    attached, on `xl destroy` all the devices needs to be detach, and
    this is usually what happens:
- 'device_del' called for the 1st pci device
- 'query-pci' checking if pci still there, it is
- wait 1s
- 'query-pci' checking again, and it's gone
-> now the same can be done for the second pci device, so
plenty of waiting on others when pci detach can be done in
parallel.

    On shutdown, libxl usually keeps waiting because QEMU never
    releases the device because the guest kernel never responds QEMU's
    unplug queries. So detaching of the 1st device waits until a
    timeout stops it, and since the same timeout is setup at the same
    time for the other devices to detach, the 'device_del' command is
    never sent for those.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl_qmp: Have a lock for QMP socket access
Anthony PERARD [Mon, 18 Nov 2019 17:13:08 +0000 (17:13 +0000)]
libxl_qmp: Have a lock for QMP socket access

This patch workaround the fact that it's not possible to connect
multiple time to a single QMP socket. QEMU listen on the socket with
a backlog value of 1, which mean that on Linux when concurrent thread
call connect() on the socket, they get EAGAIN.

Background:
    This happens when attempting to create a guest with multiple
    pci devices passthrough, libxl creates one connection per device to
    attach and execute connect() on all at once before any single
    connection has finished.

To work around this, we use a new lock.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Introduce libxl__ev_immediate
Anthony PERARD [Mon, 18 Nov 2019 18:10:14 +0000 (18:10 +0000)]
libxl: Introduce libxl__ev_immediate

This new ev allows to arrange a non-reentrant callback to be called.
This happen immediately after the current event is processed and after
other ev_immediates that would have already been registered.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: libxl__ev_qmp_send now takes an egc
Anthony PERARD [Mon, 18 Nov 2019 17:13:06 +0000 (17:13 +0000)]
libxl: libxl__ev_qmp_send now takes an egc

No functionnal changes.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Introduce libxl__ev_slowlock_dispose
Anthony PERARD [Mon, 18 Nov 2019 17:13:05 +0000 (17:13 +0000)]
libxl: Introduce libxl__ev_slowlock_dispose

Which allow to cancel the lock operation while it is in Active state.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Rename ev_devlock to ev_slowlock
Anthony PERARD [Mon, 18 Nov 2019 17:13:04 +0000 (17:13 +0000)]
libxl: Rename ev_devlock to ev_slowlock

We are going to introduce a different lock based on the same
implementation as the ev_devlock but with a different path. The
different slowlock will be differentiated by calling different _init()
functions.

So we rename libxl__ev_devlock to lib__ev_slowlock, but keep
libxl__ev_devlock_init().

Some log messages produced ev_slowlock are changed to print the
name of the lock file (userdata_userid).

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Move libxl__ev_devlock declaration
Anthony PERARD [Mon, 18 Nov 2019 17:13:03 +0000 (17:13 +0000)]
libxl: Move libxl__ev_devlock declaration

We are going to want to include libxl__ev_devlock into libxl__ev_qmp.

No functional changes.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: Introduce libxl__ev_child_kill_deregister
Anthony PERARD [Mon, 18 Nov 2019 17:13:02 +0000 (17:13 +0000)]
libxl: Introduce libxl__ev_child_kill_deregister

Allow to deregister the callback associated with a child death event.

The death isn't immediate will need to be collected later, so the
ev_child machinery register its own callback.

libxl__ev_child_kill_deregister() might be called by an AO operation
that is finishing/cleaning up without a chance for libxl to be
notified of the child death (via SIGCHLD). So it is possible that the
application calls libxl_ctx_free() while there are still child around.
To avoid the application getting unexpected SIGCHLD, the libxl__ao
responsible for killing a child will have to wait until it has been
properly reaped.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agodocs: adjust xen release cycle text
Wei Liu [Fri, 15 Nov 2019 14:27:20 +0000 (14:27 +0000)]
docs: adjust xen release cycle text

Fix text about release cycle. Drop the conjured up example that's no
longer applicable.

Signed-off-by: Wei Liu <wl@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: fix race to build arch/x86/efi/relocs-dummy.o
Anthony PERARD [Fri, 15 Nov 2019 13:18:16 +0000 (14:18 +0100)]
x86: fix race to build arch/x86/efi/relocs-dummy.o

With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.

Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.

Have only one makefile responsible for building relocs-dummy.o.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoAMD/IOMMU: restore DTE fields in amd_iommu_setup_domain_device()
Jan Beulich [Fri, 15 Nov 2019 13:17:26 +0000 (14:17 +0100)]
AMD/IOMMU: restore DTE fields in amd_iommu_setup_domain_device()

Commit 1b00c16bdf ("AMD/IOMMU: pre-fill all DTEs right after table
allocation") moved ourselves into a more secure default state, but
didn't take sufficient care to also undo the effects when handing a
previously disabled device back to a(nother) domain. Put the fields
that may have been changed elsewhere back to their intended values
(some fields amd_iommu_disable_domain_device() touches don't
currently get written anywhere else, and hence don't need modifying
here).

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86emul: 16-bit XBEGIN does not truncate rIP
Jan Beulich [Fri, 15 Nov 2019 13:15:31 +0000 (14:15 +0100)]
x86emul: 16-bit XBEGIN does not truncate rIP

SDM rev 071 points out this fact explicitly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agolibxl: fix device model timeout in libxl__dm_resume()
Juergen Gross [Fri, 15 Nov 2019 07:04:14 +0000 (08:04 +0100)]
libxl: fix device model timeout in libxl__dm_resume()

libxl__dm_resume() is using a wrong timeout for the start of the
device model. Instead of 60 seconds the timeout is set to 60
milliseconds.

Reported-by: Roman Shaposhnik <roman@zededa.com>
Fixes: 6298f0eb8f4437 ("libxl: Re-introduce libxl__domain_resume")
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agogolang/xenlight: Fix libxl_domain_shutdown and libxl_domain_reboot as well
George Dunlap [Fri, 15 Nov 2019 09:36:58 +0000 (09:36 +0000)]
golang/xenlight: Fix libxl_domain_shutdown and libxl_domain_reboot as well

Both are now potentially asynchronous; pass in 'nil' to retain
synchronous behavior.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen/sched: Render sibling/core masks with %pbl to improve 'r' debugkey
Andrew Cooper [Wed, 13 Nov 2019 18:11:17 +0000 (18:11 +0000)]
xen/sched: Render sibling/core masks with %pbl to improve 'r' debugkey

For system with large numbers of CPUs, the 'r' debugkey is unwieldy.  Sibling
and core masks are a single block of adjacent bits, so are vastly shorter to
render with %pbl.

Before:
  (XEN) CPU[00] nr_run=0, sort=157, sibling=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003, core=00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff
  (XEN) CPU[01] nr_run=0, sort=13750, sibling=00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003, core=00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff
  (XEN) CPU[02] nr_run=0, sort=188, sibling=00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000c, core=00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff
  (XEN) CPU[03] nr_run=0, sort=13730, sibling=00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000000c, core=00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff

After:
  (XEN) CPU[00] nr_run=0, sort=1169, sibling={0-1}, core={0-127}
  (XEN) CPU[01] nr_run=0, sort=2488, sibling={0-1}, core={0-127}
  (XEN) CPU[02] nr_run=0, sort=1210, sibling={2-3}, core={0-127}
  (XEN) CPU[03] nr_run=0, sort=2476, sibling={2-3}, core={0-127}

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoAMD/IOMMU: Fix crash in 'V' debugkey
Andrew Cooper [Wed, 13 Nov 2019 13:19:36 +0000 (13:19 +0000)]
AMD/IOMMU: Fix crash in 'V' debugkey

c/s bb038f31168 "AMD/IOMMU: replace INTREMAP_ENTRIES" introduces a call to
intremap_table_entries() in dump_intremap_table() before tbl.ptr is checked
for NULL.

intremap_table_entries() internally uses virt_to_page() which falls over

  ASSERT(va >= XEN_VIRT_START);

in __virt_to_page().

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen/passthrough: Render domains with %pd in the 'Q' debug handler
Andrew Cooper [Wed, 13 Nov 2019 13:04:43 +0000 (13:04 +0000)]
xen/passthrough: Render domains with %pd in the 'Q' debug handler

IOMMUs are owned by DOM_XEN, and with XSA-302, DOM_IO is used for
quarantined devices.  Use %pd in the printk to render the system
domains more intelligently.

Before:
  (XEN) 0000:00:01.0 - dom 0   - node 0   - MSIs < >
  (XEN) 0000:00:00.0 - dom 0   - node 0   - MSIs < >
  (XEN) 0000:80:00.2 - dom 32754 - node 1   - MSIs < >
  (XEN) 0000:a0:00.2 - dom 32754 - node 1   - MSIs < >
  (XEN) 0000:c0:00.2 - dom 32754 - node 1   - MSIs < >
  (XEN) 0000:e0:00.2 - dom 32754 - node 1   - MSIs < >
  (XEN) 0000:00:00.2 - dom 32754 - node 0   - MSIs < >
  (XEN) 0000:20:00.2 - dom 32754 - node 0   - MSIs < >
  (XEN) 0000:40:00.2 - dom 32754 - node 0   - MSIs < >
  (XEN) 0000:60:00.2 - dom 32754 - node 0   - MSIs < >

After:
  (XEN) 0000:00:01.0 - d0 - node 0   - MSIs < >
  (XEN) 0000:00:00.0 - d0 - node 0   - MSIs < >
  (XEN) 0000:80:00.2 - d[XEN] - node 1   - MSIs < >
  (XEN) 0000:a0:00.2 - d[XEN] - node 1   - MSIs < >
  (XEN) 0000:c0:00.2 - d[XEN] - node 1   - MSIs < >
  (XEN) 0000:e0:00.2 - d[XEN] - node 1   - MSIs < >
  (XEN) 0000:00:00.2 - d[XEN] - node 0   - MSIs < >
  (XEN) 0000:20:00.2 - d[XEN] - node 0   - MSIs < >
  (XEN) 0000:40:00.2 - d[XEN] - node 0   - MSIs < >
  (XEN) 0000:60:00.2 - d[XEN] - node 0   - MSIs < >

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel
Andrew Cooper [Wed, 19 Jun 2019 17:16:03 +0000 (18:16 +0100)]
x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel

See patch documentation and comments.

This is part of XSA-305 / CVE-2019-11135

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>