]> xenbits.xensource.com Git - people/liuw/xen.git/log
people/liuw/xen.git
5 years agox86/viridian: drop private copy of HV_REFERENCE_TSC_PAGE in time.c hyperv-ref-tsc-2
Wei Liu [Tue, 17 Dec 2019 18:28:39 +0000 (18:28 +0000)]
x86/viridian: drop private copy of HV_REFERENCE_TSC_PAGE in time.c

Use the one defined in hyperv-tlfs.h instead. No functional change
intended.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
5 years agox86/viridian: drop private copy of definitions from synic.c
Wei Liu [Tue, 17 Dec 2019 18:23:22 +0000 (18:23 +0000)]
x86/viridian: drop private copy of definitions from synic.c

Use hyperv-tlfs.h instead. No functional change intended.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
5 years agox86/viridian: drop duplicate defines from private.h and viridian.c
Wei Liu [Tue, 17 Dec 2019 17:20:01 +0000 (17:20 +0000)]
x86/viridian: drop duplicate defines from private.h and viridian.c

No functional change intended.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
5 years agox86: import hyperv-tlfs.h from Linux
Wei Liu [Thu, 24 Oct 2019 11:17:03 +0000 (12:17 +0100)]
x86: import hyperv-tlfs.h from Linux

Take a pristine copy from Linux commit b2d8b167e15bb5ec2691d1119c025630a247f649.

Do the following to fix it up for Xen:

1. include xen/types.h and xen/bitops.h
2. fix up invocations of BIT macro

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/arm: Basic support for sunxi/sun50i h6 platform.
Yangtao Li [Mon, 2 Dec 2019 08:49:24 +0000 (08:49 +0000)]
xen/arm: Basic support for sunxi/sun50i h6 platform.

adding compatible strings for h6 SoCs, Specifically orangepi3.

Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Reviewed-by: Andre Przywara <andre.przywara@arm.com
Tested-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agolibxc/restore: Fix error message for unrecognised stream version
Andrew Cooper [Tue, 17 Dec 2019 13:49:56 +0000 (13:49 +0000)]
libxc/restore: Fix error message for unrecognised stream version

The Expected and Got values are rendered in the wrong order.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
5 years agogolang/xenlight: implement keyed union C to Go marshaling
Nick Rosbrook [Mon, 16 Dec 2019 18:08:10 +0000 (18:08 +0000)]
golang/xenlight: implement keyed union C to Go marshaling

Switch over union key to determine how to populate 'union' in Go struct.

Since the unions of C types cannot be directly accessed in cgo, use a
typeof trick to typedef a struct in the cgo preamble that is analagous
to each inner struct of a keyed union. For example, to define a struct
for the hvm inner struct of libxl_domain_build_info, do:

  typedef typeof(((struct libxl_domain_build_info *)NULL)->u.hvm) libxl_domain_build_info_type_union_hvm;

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: begin C to Go type marshaling
Nick Rosbrook [Mon, 16 Dec 2019 18:08:09 +0000 (18:08 +0000)]
golang/xenlight: begin C to Go type marshaling

Begin implementation of fromC marshaling functions for generated struct
types. This includes support for converting fields that are basic
primitive types such as string and integer types, nested anonymous
structs, nested libxl structs, and libxl built-in types.

This patch does not implement conversion of arrays or keyed unions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: remove no-longer used type MemKB
Nick Rosbrook [Mon, 16 Dec 2019 18:08:08 +0000 (18:08 +0000)]
golang/xenlight: remove no-longer used type MemKB

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: generate structs from the IDL
Nick Rosbrook [Mon, 16 Dec 2019 18:08:08 +0000 (18:08 +0000)]
golang/xenlight: generate structs from the IDL

Add struct and keyed union generation to gengotypes.py. For keyed unions,
use a method similar to gRPC's oneof to interpret C unions as Go types.
Meaning, for a given struct with a union field, generate a struct for
each sub-struct defined in the union. Then, define an interface of one
method which is implemented by each of the defined sub-structs. For
example:

  type domainBuildInfoTypeUnion interface {
          isdomainBuildInfoTypeUnion()
  }

  type DomainBuildInfoTypeUnionHvm struct {
      // HVM-specific fields...
  }

  func (x DomainBuildInfoTypeUnionHvm) isdomainBuildInfoTypeUnion() {}

  type DomainBuildInfoTypeUnionPv struct {
      // PV-specific fields...
  }

  func (x DomainBuildInfoTypeUnionPv) isdomainBuildInfoTypeUnion() {}

  type DomainBuildInfoTypeUnionPvh struct {
      // PVH-specific fields...
  }

  func (x DomainBuildInfoTypeUnionPvh) isdomainBuildInfoTypeUnion() {}

Then, remove existing struct definitions in xenlight.go that conflict
with the generated types, and modify existing marshaling functions to
align with the new type definitions. Notably, drop "time" package since
fields of type time.Duration are now of type uint64.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: re-factor Hwcap type implementation
Nick Rosbrook [Mon, 16 Dec 2019 18:08:07 +0000 (18:08 +0000)]
golang/xenlight: re-factor Hwcap type implementation

Re-define Hwcap as [8]uint32, and implement toC function. Also, re-name and
modify signature of toGo function to fromC.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: re-factor Uuid type implementation
Nick Rosbrook [Mon, 16 Dec 2019 18:08:06 +0000 (18:08 +0000)]
golang/xenlight: re-factor Uuid type implementation

Re-define Uuid as [16]byte and implement fromC, toC, and String functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define CpuidPolicyList builtin type
Nick Rosbrook [Mon, 16 Dec 2019 18:08:05 +0000 (18:08 +0000)]
golang/xenlight: define CpuidPolicyList builtin type

Define CpuidPolicyList as a string so that libxl_cpuid_parse_config can
be used in the toC function.

For now, fromC is a no-op since libxl does not support a way to read a
policy, modify it,and then give it back to libxl.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define EvLink builtin as empty struct
Nick Rosbrook [Mon, 16 Dec 2019 18:08:05 +0000 (18:08 +0000)]
golang/xenlight: define EvLink builtin as empty struct

Define EvLink as empty struct as there is currently no reason the internal of
this type should be used in Go.

Implement fromC and toC functions as no-ops.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define MsVmGenid builtin type
Nick Rosbrook [Mon, 16 Dec 2019 18:08:04 +0000 (18:08 +0000)]
golang/xenlight: define MsVmGenid builtin type

Define MsVmGenid as [int(C.LIBXL_MS_VM_GENID_LEN)]byte and implement fromC and toC functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define Mac builtin type
Nick Rosbrook [Mon, 16 Dec 2019 18:08:03 +0000 (18:08 +0000)]
golang/xenlight: define Mac builtin type

Define Mac as [6]byte and implement fromC, toC, and String functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define StringList builtin type
Nick Rosbrook [Mon, 16 Dec 2019 18:08:02 +0000 (18:08 +0000)]
golang/xenlight: define StringList builtin type

Define StringList as []string an implement fromC and toC functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: re-name Bitmap marshaling functions
Nick Rosbrook [Mon, 16 Dec 2019 18:08:01 +0000 (18:08 +0000)]
golang/xenlight: re-name Bitmap marshaling functions

Re-name and modify signature of toGo function to fromC. The reason for
using 'fromC' rather than 'toGo' is that it is not a good idea to define
methods on the C types. Also, add error return type to Bitmap's toC function.

Finally, as code-cleanup, re-organize the Bitmap type's comments as per
Go conventions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
--
Changes in v2:
- Use consistent variable naming for slice created from
  libxl_bitmap.

5 years agogolang/xenlight: define KeyValueList as empty struct
Nick Rosbrook [Mon, 16 Dec 2019 18:08:01 +0000 (18:08 +0000)]
golang/xenlight: define KeyValueList as empty struct

Define KeyValueList as empty struct as there is currently no reason for
this type to be available in the Go package.

Implement fromC and toC functions as no-ops.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define Devid type as int
Nick Rosbrook [Mon, 16 Dec 2019 18:08:00 +0000 (18:08 +0000)]
golang/xenlight: define Devid type as int

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: define Defbool builtin type
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: define Defbool builtin type

Define Defbool as struct analagous to the C type, and define the type
'defboolVal' that represent true, false, and default defbool values.

Implement Set, Unset, SetIfDefault, IsDefault, Val, and String functions
on Defbool so that the type can be used in Go analagously to how its
used in C.

Finally, implement fromC and toC functions.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agogolang/xenlight: generate enum types from IDL
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: generate enum types from IDL

Introduce gengotypes.py to generate Go code the from IDL. As a first step,
implement 'enum' type generation.

As a result of the newly-generated code, remove the existing, and now
conflicting definitions in xenlight.go. In the case of the Error type,
rename the slice 'errors' to 'libxlErrors' so that it does not conflict
with the standard library package 'errors.' And, negate the values used
in 'libxlErrors' since the generated error values are negative.

Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86emul: correct far branch handling for 64-bit mode
Jan Beulich [Mon, 16 Dec 2019 16:37:09 +0000 (17:37 +0100)]
x86emul: correct far branch handling for 64-bit mode

AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agopublic/io/ring.h: add FRONT/BACK_RING_ATTACH macros
Paul Durrant [Mon, 16 Dec 2019 16:36:37 +0000 (17:36 +0100)]
public/io/ring.h: add FRONT/BACK_RING_ATTACH macros

The version of this header present in the Linux source tree has contained
such macros for some time. These macros, as the names imply, allow front
or back rings to be set up for existent (rather than freshly created and
zeroed) shared rings.

This patch is to update this, the canonical version of the header, to
match the latest definition of these macros in the Linux source.

NOTE: The way the new macros are defined allows the FRONT/BACK_RING_INIT
      macros to be re-defined in terms of them, thereby reducing
      duplication.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
5 years agox86emul: correct LFS et al handling for 64-bit mode
Jan Beulich [Mon, 16 Dec 2019 16:35:50 +0000 (17:35 +0100)]
x86emul: correct LFS et al handling for 64-bit mode

AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86emul: correct segment override decode for 64-bit mode
Jan Beulich [Mon, 16 Dec 2019 16:34:46 +0000 (17:34 +0100)]
x86emul: correct segment override decode for 64-bit mode

The legacy / compatibility mode ES, CS, SS, and DS overrides are fully
ignored prefixes in 64-bit mode, i.e. they in particular don't cancel an
earlier FS or GS one. (They don't violate the REX-prefix-must-be-last
rule though.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/time: drop vtsc_{kern, user}count debug counters
Igor Druzhinin [Fri, 13 Dec 2019 22:48:01 +0000 (22:48 +0000)]
x86/time: drop vtsc_{kern, user}count debug counters

They either need to be transformed to atomics to work correctly
(currently they left unprotected for HVM domains) or dropped entirely
as taking a per-domain spinlock is too expensive for high-vCPU count
domains even for debug build given this lock is taken too often.

Choose the latter as they are not extremely important anyway.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/pv: Fix `global-pages` to match the documentation
Andrew Cooper [Mon, 16 Dec 2019 13:58:45 +0000 (13:58 +0000)]
x86/pv: Fix `global-pages` to match the documentation

c/s 5de961d9c09 "x86: do not enable global pages when virtualized on AMD or
Hygon hardware" in fact does.  Fix the calculation in pge_init().

While fixing this, adjust the command line documenation, first to use the
newer style, and to expand the description to discuss cases where the option
might be useful to use, but Xen can't account for by default.

Fixes: 5de961d9c09 ('x86: do not enable global pages when virtualized on AMD or Hygon hardware')
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: More discriptive names for page de/validation functions
George Dunlap [Thu, 12 Dec 2019 15:57:51 +0000 (15:57 +0000)]
x86/mm: More discriptive names for page de/validation functions

The functions alloc_page_type(), alloc_lN_table(), free_page_type()
and free_lN_table() are confusingly named: nothing is being allocated
or freed.  Rather, the page being passed in is being either validated
or devalidated for use as the specific type; in the specific case of
pagetables, these may be promoted or demoted (i.e., grab appropriate
references for PTEs).

Rename alloc_page_type() and free_page_type() to validate_page() and
devalidate_page().  Also rename alloc_segdesc_page() to
validate_segdesc_page(), since this is what it's doing.

Rename alloc_lN_table() and free_lN_table() to promote_lN_table() and
demote_lN_table(), respectively.

After this change:
- get / put type consistenly refer to increasing or decreasing the count
- validate / devalidate consistently refers to actions done when a
type count goes 0 -> 1 or 1 -> 0
- promote / demote consistenly refers to acquiring or freeing
resources (in the form of type refs and general references) in order
to allow a page to be used as a pagetable.

No functional change.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/mm: Use mfn_t in type get / put call tree
George Dunlap [Fri, 13 Dec 2019 14:09:46 +0000 (14:09 +0000)]
x86/mm: Use mfn_t in type get / put call tree

Replace `unsigned long` with `mfn_t` as appropriate throughout
alloc/free_lN_table, get/put_page_from_lNe, and
get_lN_linear_pagetable.  This obviates the need for a load of
`mfn_x()` and `_mfn()` casts.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/mm: Use a more descriptive name for pagetable mfns
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Use a more descriptive name for pagetable mfns

In many places, a PTE being modified is accompanied by the pagetable
mfn which contains the PTE (primarily in order to be able to maintain
linear mapping counts).  In many cases, this mfn is stored in the
non-descript variable (or argement) "pfn".

Replace these names with lNmfn, to indicate that 1) this is a
pagetable mfn, and 2) that it is the same level as the PTE in
question.  This should be enough to remind readers that it's the mfn
containing the PTE.

No functional change.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/mm: Implement common put_data_pages for put_page_from_l[23]e
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Implement common put_data_pages for put_page_from_l[23]e

Both put_page_from_l2e and put_page_from_l3e handle having superpage
entries by looping over each page and "put"-ing each one individually.
As with putting page table entries, this code is functionally
identical, but for some reason different.  Moreover, there is already
a common function, put_data_page(), to handle automatically swapping
between put_page() (for read-only pages) or put_page_and_type() (for
read-write pages).

Replace this with put_data_pages() (plural), which does the entire
loop, as well as the put_page / put_page_and_type switch.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: Refactor put_page_from_l*e to reduce code duplication
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Refactor put_page_from_l*e to reduce code duplication

put_page_from_l[234]e have identical functionality for devalidating an
entry pointing to a pagetable.  But mystifyingly, they duplicate the
code in slightly different arrangements that make it hard to tell that
it's the same.

Create a new function, put_pt_page(), which handles the common
functionality; and refactor all the functions to be symmetric,
differing only in the level of pagetable expected (and in whether they
handle superpages).

Other than put_page_from_l2e() gaining an ASSERT it probably should
have had already, no functional changes.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agopublic/io/netif.h: document a mechanism to advertise carrier state
Paul Durrant [Fri, 13 Dec 2019 16:39:44 +0000 (16:39 +0000)]
public/io/netif.h: document a mechanism to advertise carrier state

This patch adds a specification for a 'carrier' node in xenstore to allow
a backend to notify a frontend of it's virtual carrier/link state. E.g.
a backend that is unable to forward packets from the guest because it is
not attached to a bridge may wish to advertise 'no carrier'.

While in the area also fix an erroneous backend path description.

NOTE: This is purely a documentation patch. No functional change.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
5 years agoConfig.mk: Remove stray comment
Anthony PERARD [Thu, 12 Dec 2019 18:27:34 +0000 (18:27 +0000)]
Config.mk: Remove stray comment

This comment isn't about CONFIG_TESTS, but about SEABIOS_DIR that has
been removed.

Originally, the comment was added by 5f82d0858de1 ("tools: support
SeaBIOS. Use by default when upstream qemu is configured."), then
later the SEABIOS_DIR was removed by 14ee3c05f3ef ("Clone and build
Seabios by default") but that comment about the pain was left behind.
The commit that made CONFIG_TESTS painful was 85896a7c4dc7 ("build:
add autoconf to replace custom checks in tools/check").

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoConfig.mk: Remove unused setvar_dir macro
Anthony PERARD [Thu, 12 Dec 2019 18:27:33 +0000 (18:27 +0000)]
Config.mk: Remove unused setvar_dir macro

And remove all mention of it in docs. It hasn't been used since
9ead9afcb935 ("Add configure --with-sysconfig-leaf-dir=SUBDIR to set
CONFIG_LEAF_DIR").

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agobuild: fix tools/configure in case only python3 exists
Juergen Gross [Wed, 11 Dec 2019 16:56:59 +0000 (17:56 +0100)]
build: fix tools/configure in case only python3 exists

Calling ./configure with python3 being there but no python,
tools/configure will fail. Fix that by defaulting to python and
falling back to python3 or python2.

While at it fix the use of non portable "type -p" by replacing it by
AC_PATH_PROG().

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
[ wei: run autogen.sh ]
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
5 years agoAMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
Andrew Cooper [Wed, 11 Dec 2019 13:55:32 +0000 (14:55 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables

update_paging_mode() has multiple bugs:

 1) Booting with iommu=debug will cause it to inform you that that it called
    without the pdev_list lock held.
 2) When growing by more than a single level, it leaks the newly allocated
    table(s) in the case of a further error.

Furthermore, the choice of default level for a domain has issues:

 1) All HVM guests grow from 2 to 3 levels during construction because of the
    position of the VRAM just below the 4G boundary, so defaulting to 2 is a
    waste of effort.
 2) The limit for PV guests doesn't take memory hotplug into account, and
    isn't dynamic at runtime like HVM guests.  This means that a PV guest may
    get RAM which it can't map in the IOMMU.

The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement.  Remove
the complexity by removing the dynamic height.

PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.

HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.

The overhead of this extra level is not expected to be noticeable.  It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.

This is XSA-311.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial
George Dunlap [Mon, 28 Oct 2019 14:33:51 +0000 (14:33 +0000)]
x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial

The PGT_partial bit in page->type_info holds both a type count and a
general ref count.  During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial.  When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():

    BUG_ON((pg[i].u.inuse.type_info & PGT_count_mask) != 0);

As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().

(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)

Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR
George Dunlap [Thu, 31 Oct 2019 11:17:38 +0000 (11:17 +0000)]
x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR

When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.

One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).

Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR.  This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set.  In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.

Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.

NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i.  On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.

This is part of XSA-310.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: Set old_guest_table when destroying vcpu pagetables
George Dunlap [Tue, 19 Nov 2019 11:40:34 +0000 (11:40 +0000)]
x86/mm: Set old_guest_table when destroying vcpu pagetables

Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set.  If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.

One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set.  The
result was that if such an operation were interrupted, Xen would hit a
BUG().

Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
  put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately

While here, do some refactoring:

 - Move clearing of arch.cr3 to the top of the function

 - Now that clearing is unconditional, move the unmap to the same
   conditional as the l4tab mapping.  This also allows us to reduce
   the scope of the l4tab variable.

 - Avoid code duplication by looping to drop references on
   guest_table_user

This is part of XSA-310.

Reported-by: Sarah Newman <srn@prgmr.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/mm: Don't reset linear_pt_count on partial validation
George Dunlap [Wed, 30 Oct 2019 17:05:28 +0000 (17:05 +0000)]
x86/mm: Don't reset linear_pt_count on partial validation

"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).

XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both.  To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.

Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.

On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:

    Assertion 'oc > 0' failed at mm.c:874

Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.

This is XSA-309.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/vtx: Work around SingleStep + STI/MovSS VMEntry failures
Andrew Cooper [Wed, 11 Dec 2019 13:09:30 +0000 (14:09 +0100)]
x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures

See patch comment for technical details.

Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.

After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series.  Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.

A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.

This is XSA-308

Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
5 years agox86+Arm32: make find_next_{,zero_}bit() have well defined behavior
Jan Beulich [Wed, 11 Dec 2019 13:06:18 +0000 (14:06 +0100)]
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior

These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.

Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.

This is XSA-307.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoConfig.mk: update seabios to 1.13.0
Wei Liu [Wed, 11 Dec 2019 12:02:26 +0000 (12:02 +0000)]
Config.mk: update seabios to 1.13.0

Signed-off-by: Wei Liu <wl@xen.org>
5 years agox86: add a comment regarding the location of hypervisor_probe
Wei Liu [Wed, 11 Dec 2019 11:33:03 +0000 (11:33 +0000)]
x86: add a comment regarding the location of hypervisor_probe

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoSUPPORT.md: add core scheduling
Juergen Gross [Wed, 11 Dec 2019 08:45:49 +0000 (09:45 +0100)]
SUPPORT.md: add core scheduling

Add core scheduling feature to SUPPORT.md.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agodocs/sphinx: How Xen Boots on x86
Andrew Cooper [Sat, 19 Oct 2019 19:12:44 +0000 (12:12 -0700)]
docs/sphinx: How Xen Boots on x86

Begin to document how the x86 build of Xen boots.  It is by no means complete,
but is a start.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agoxen/build: Automatically locate a suitable python interpreter
Andrew Cooper [Sat, 7 Dec 2019 15:50:22 +0000 (15:50 +0000)]
xen/build: Automatically locate a suitable python interpreter

Needing to pass PYTHON=python3 into hypervisor builds is irritating and
unnecessary.  Locate a suitable interpreter automatically, defaulting to Py3
if it is available.

Reported-by: Steven Haigh <netwiz@crc.id.au>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen/banner: Drop the fig-to-oct.py script
Andrew Cooper [Sat, 7 Dec 2019 17:45:10 +0000 (17:45 +0000)]
xen/banner: Drop the fig-to-oct.py script

The script is 664 rather than 775, so the banner conversion doesn't actually
work if $(PYTHON) is empty:

  /bin/sh: tools/fig-to-oct.py: Permission denied
  make[3]: *** [include/xen/compile.h] Error 126
  make[3]: Leaving directory `/builds/xen-project/people/andyhhp/xen/xen'

Fixing this is easy, but using python here is wasteful.  compile.h doesn't
need XEN_BANNER rendering in octal, and text is much more simple to handle.
Replace fig-to-oct.py with a smaller sed script.  This could be a shell
one-liner, but it is much more simple to comment sensibly, and doesn't need to
include the added cognative load of makefile and shell escaping.

While changing this logic, take the opportunity to optimise the banner
space (and time on the serial port) by dropping trailing whitespace, which is
84 characters for current staging.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agoxen/flask: Drop the gen-policy.py script
Andrew Cooper [Sat, 7 Dec 2019 16:20:55 +0000 (16:20 +0000)]
xen/flask: Drop the gen-policy.py script

The script is Python 2 specific, and fails with string/binary issues with
Python 3:

  Traceback (most recent call last):
    File "gen-policy.py", line 14, in <module>
      for char in sys.stdin.read():
    File "/usr/lib/python3.5/codecs.py", line 321, in decode
      (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 0: invalid start byte

Fixing the script to be compatible isn't hard, but using python here is
wasteful.  Drop the script entirely, and write an equivelent flask-policy.S
instead.  This removes the need for a $(PYTHON) and $(CC) pass.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Julien Grall <julien@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoremove myself as vm_event maintainer
Razvan Cojocaru [Tue, 10 Dec 2019 10:34:33 +0000 (11:34 +0100)]
remove myself as vm_event maintainer

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
5 years agox86: do not enable global pages when virtualized on AMD or Hygon hardware
Roger Pau Monné [Tue, 10 Dec 2019 10:34:00 +0000 (11:34 +0100)]
x86: do not enable global pages when virtualized on AMD or Hygon hardware

When using global pages a full tlb flush can only be performed by
toggling the PGE bit in CR4, which is usually quite expensive in terms
of performance when running virtualized. This is specially relevant on
AMD or Hygon hardware, which doesn't have the ability to do selective
CR4 trapping, but can also be relevant on e.g. Intel if the underlying
hypervisor also traps accesses to the PGE CR4 bit.

In order to avoid this performance penalty, do not use global pages
when running virtualized on AMD or Hygon hardware. A command line option
'global-pages' is provided in order to allow the user to select whether
global pages will be enabled for PV guests.

The above figures are from a PV shim running on AMD hardware with
32 vCPUs:

PGE enabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804b01c0, lockval=1adb1adb, not locked
(XEN)   lock:1841883(1375128998543), block:1658716(10193054890781)

Average lock time:   746588ns
Average block time: 6145147ns

PGE disabled, x2APIC mode:

(XEN) Global lock flush_lock: addr=ffff82d0804af1c0, lockval=a8bfa8bf, not locked
(XEN)   lock:2730175(657505389886), block:2039716(2963768247738)

Average lock time:   240829ns
Average block time: 1453029ns

As seen from the above figures the lock and block time of the flush
lock is reduced to approximately 1/3 of the original value.

Note that XEN_MINIMAL_CR4 and mmu_cr4_features are not modified, and
thus global pages are left enabled for the hypervisor. This is not an
issue because the code to switch the control registers (cr3 and cr4)
already takes into account such situation and performs the necessary
flushes. The same already happens when using XPTI or PCIDE, as the
guest cr4 doesn't have global pages enabled in that case either.

Also note that the suspend and resume code is correct in writing
mmu_cr4_features into cr4 on resume, since that's the cr4 used by the
idle vCPU which is the context used by the suspend and resume routine.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/AMD: unbreak CPU hotplug on AMD systems without RstrFpErrPtrs
Igor Druzhinin [Tue, 10 Dec 2019 10:07:22 +0000 (11:07 +0100)]
x86/AMD: unbreak CPU hotplug on AMD systems without RstrFpErrPtrs

If the feature is not present Xen will try to force X86_BUG_FPU_PTRS
feature at CPU identification time. This is especially noticeable in
PV-shim that usually hotplugs its vCPUs. We either need to restrict this
action for boot CPU only or allow secondary CPUs to modify
forced CPU capabilities at runtime. Choose the former since modifying
forced capabilities out of boot path leaves the system in potentially
inconsistent state.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/i8259A: don't open-code LEGACY_VECTOR()
Jan Beulich [Mon, 9 Dec 2019 13:03:01 +0000 (14:03 +0100)]
x86/i8259A: don't open-code LEGACY_VECTOR()

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agolz4: fix system halt at boot kernel on x86_64
Krzysztof Kolasa [Mon, 9 Dec 2019 13:02:35 +0000 (14:02 +0100)]
lz4: fix system halt at boot kernel on x86_64

Sometimes, on x86_64, decompression fails with the following
error:

Decompressing Linux...

Decoding failed

 -- System halted

This condition is not needed for a 64bit kernel(from commit d5e7caf):

if( ... ||
    (op + COPYLENGTH) > oend)
    goto _output_error

macro LZ4_SECURE_COPY() tests op and does not copy any data
when op exceeds the value.

added by analogy to lz4_uncompress_unknownoutputsize(...)

Signed-off-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
[Linux commit 99b7e93c95c78952724a9783de6c78def8fbfc3f]

The offending commit in our case is fcc17f96c277 ("LZ4 : fix the data
abort issue").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agolz4: pull out constant tables
Rasmus Villemoes [Mon, 9 Dec 2019 13:01:56 +0000 (14:01 +0100)]
lz4: pull out constant tables

There's no reason to allocate the dec{32,64}table on the stack; it
just wastes a bunch of instructions setting them up and, of course,
also consumes quite a bit of stack. Using size_t for such small
integers is a little excessive.

$ scripts/bloat-o-meter /tmp/built-in.o lib/built-in.o
add/remove: 2/2 grow/shrink: 2/0 up/down: 1304/-1548 (-244)
function                                     old     new   delta
lz4_decompress_unknownoutputsize              55     718    +663
lz4_decompress                                55     632    +577
dec64table                                     -      32     +32
dec32table                                     -      32     +32
lz4_uncompress                               747       -    -747
lz4_uncompress_unknownoutputsize             801       -    -801

The now inlined lz4_uncompress functions used to have a stack
footprint of 176 bytes (according to -fstack-usage); their inlinees
have increased their stack use from 32 bytes to 48 and 80 bytes,
respectively.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
[Linux commit bea2b592fd18eb8ffa3fc4ad380610632d03a38f]

Use {,u}int8_t instead of plain "int" for the tables.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agolz4: refine commit 9143a6c55ef7 for the 64-bit case
Jan Beulich [Mon, 9 Dec 2019 13:01:25 +0000 (14:01 +0100)]
lz4: refine commit 9143a6c55ef7 for the 64-bit case

I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to

cpy = op + length - (STEPSIZE - 4);

where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.

Reported-by: Mark Pryor <pryorm09@gmail.com>
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Mark Pryor <pryorm09@gmail.com>
Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/CPUID: RSTR_FP_ERR_PTRS depends on FPU
Jan Beulich [Mon, 9 Dec 2019 13:00:15 +0000 (14:00 +0100)]
x86/CPUID: RSTR_FP_ERR_PTRS depends on FPU

There's nothing to restore here if there's no FPU in the first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agodocs/sphinx: License content with CC-BY-4.0
Andrew Cooper [Wed, 11 Sep 2019 19:12:31 +0000 (20:12 +0100)]
docs/sphinx: License content with CC-BY-4.0

Creative Commons is a more common license than GPL for documentation purposes.
Switch to using CC-BY-4.0 to explicitly permit re-purposing and remixing of
the content.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86: support Atom Tremont
Jan Beulich [Fri, 6 Dec 2019 10:02:48 +0000 (11:02 +0100)]
x86: support Atom Tremont

Add model 0x86 to relevant switch() statements, as per SDM 069 Vol 4.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: don't offer Hyper-V option when "PV Shim Exclusive"
Jan Beulich [Fri, 6 Dec 2019 10:02:09 +0000 (11:02 +0100)]
x86: don't offer Hyper-V option when "PV Shim Exclusive"

This only added dead code. Use "if" instead of "depends on" to make
(halfway) clear that other guest options should also go in the same
block. Move the option down such that the shim related options get
presented first, avoiding to ask the question when the answer may end
up being discarded.

While in the neighborhood also bring PV_SHIM_EXCLUSIVE into more
"canonical" shape.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86/nEPT: ditch nept_sp_entry()
Jan Beulich [Fri, 6 Dec 2019 10:01:18 +0000 (11:01 +0100)]
x86/nEPT: ditch nept_sp_entry()

It's bogusly non-static. It making the call sites actually less easy to
read, and there being another open-coded use in the file - let's just
get rid of it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
5 years agox86/svm: Use named (bit)fields for task switch exit info
Andrew Cooper [Tue, 3 Dec 2019 16:57:52 +0000 (16:57 +0000)]
x86/svm: Use named (bit)fields for task switch exit info

Introduce vmcb.ei.* to provide names to fields in exitinfo{1,2}.  Implement
the task switch names for now, and clean up the TASK_SWITCH handler.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Clean up intinfo_t variables
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Clean up intinfo_t variables

The type name is poor because the type is also used for the IDT vectoring
field, not just for the event injection field.  Rename it to intinfo_t which
is how the APM refers to the data.

Rearrange the union to drop the .fields infix, and rename bytes to the more
common raw.  Also take the opportunity to rename the fields in the VMCB to
increase legibility.

While adjusting all call sites, fix up style issues and make use of structure
assignments where applicable.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Don't shadow variables in svm_vmexit_handler()
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Don't shadow variables in svm_vmexit_handler()

The local variable eventinj is set to the value of vmcb->exitintinfo which is
confusing considering that it isn't vmcb->eventinj.  The variable isn't
necessary to begin with, so drop it to avoid confusion.

A local rc variable is shadowed in the CPUID, #DB and #BP handlers.

There is a mix of spelling of inst_len and insn_len, all of which are
logically the same value.  Consolidate on insn_len which also matches the name
of the emulation functions for obtaining instruction lengths, and avoid
shadowing it in the CPUID and TASK_SWITCH handlers.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Clean up construct_vmcb()
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Clean up construct_vmcb()

The vmcb is zeroed on allocate - drop all explicit writes of 0.  Move
hvm_update_guest_efer() to co-locate it with the other control register
updates.

Move the BUILD_BUG_ON() into build_assertions(), and add some offset checks
for fields after the large blocks of reserved fields (as these are the most
likely to trigger from a mis-edit).  Take the opportunity to fold 6 adjacent
res* fields into one.

Finally, drop all trailing whitespace in the file.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Fix handling of EFLAGS.RF on task switch
Andrew Cooper [Tue, 3 Dec 2019 16:59:09 +0000 (16:59 +0000)]
x86/svm: Fix handling of EFLAGS.RF on task switch

VT-x updates RF before vmexit, so eflags written into the outgoing TSS happens
to be correct.  SVM does not update RF before vmexit, and instead provides it
via a bit in exitinfo2.

In practice, needing RF set in the outgoing state occurs when a task gate is
used to handle faults.

Extend hvm_task_switch() with an extra_eflags parameter which gets fed into
the outgoing TSS, and fill it in suitably from the SVM vmexit information.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Minor cleanup to start_svm()
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Minor cleanup to start_svm()

The function is init, so can use boot_cpu_data directly.

There is no need to write 0 to svm_feature_flags in the case of a CPUID
mismatch (not least because this is dead code on real hardware), and no need
to use locked bit operations.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agopassthrough: drop break statement following c/s cd7dedad820
Igor Druzhinin [Thu, 5 Dec 2019 12:31:03 +0000 (13:31 +0100)]
passthrough: drop break statement following c/s cd7dedad820

The locking responsibilities have changed and a premature break in
this section now causes the following assertion:

Assertion '!preempt_count()' failed at preempt.c:36

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86/svm: Correct vm_event API for descriptor accesses
Andrew Cooper [Thu, 28 Nov 2019 11:28:51 +0000 (11:28 +0000)]
x86/svm: Correct vm_event API for descriptor accesses

c/s d0a699a389f1 "x86/monitor: add support for descriptor access events"
introduced logic looking for what appeared to be exitinfo (not that this
exists in SVM - exitinfo1 or 2 do), but actually passed the exit IDT vectoring
information.  There is never any IDT vectoring involved in these intercepts so
the value passed is always zero.

In fact, SVM doesn't provide any information, even in exitinfo1 and 2.  Drop
the svm struct entirely, and bump the interface version.

In the SVM vmexit handler itself, optimise the switch statement by observing
that there is a linear transformation between the SVM exit_reason and
VM_EVENT_DESC_* values.  (Bloat-o-meter reports 6028 => 5877 for a saving of
151 bytes).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com>
Acked-by: Adrian Pop <apop@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: introduce CONFIG_HYPERV and detection code
Wei Liu [Mon, 30 Sep 2019 13:34:50 +0000 (14:34 +0100)]
x86: introduce CONFIG_HYPERV and detection code

We use the same code structure as we did for Xen.

As starters, detect Hyper-V in probe routine. More complex
functionalities will be added later.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: be more verbose when running on a hypervisor
Wei Liu [Sat, 30 Nov 2019 11:39:16 +0000 (11:39 +0000)]
x86: be more verbose when running on a hypervisor

Also replace reference to xen_guest.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: switch xen guest implementation to use hypervisor framework
Wei Liu [Tue, 3 Dec 2019 17:12:50 +0000 (17:12 +0000)]
x86: switch xen guest implementation to use hypervisor framework

Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: rename hypervisor_{alloc,free}_unused_page
Wei Liu [Mon, 30 Sep 2019 12:53:16 +0000 (13:53 +0100)]
x86: rename hypervisor_{alloc,free}_unused_page

They are used in Xen code only.

No functional change.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: introduce hypervisor framework
Wei Liu [Mon, 30 Sep 2019 10:06:39 +0000 (11:06 +0100)]
x86: introduce hypervisor framework

We will soon implement Hyper-V support for Xen. Add a framework for
that.

This requires moving some of the hypervisor_* functions from xen.h to
hypervisor.h.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: drop hypervisor_cpuid_base
Wei Liu [Thu, 19 Sep 2019 14:04:25 +0000 (15:04 +0100)]
x86: drop hypervisor_cpuid_base

The only user is Xen specific code in PV shim. We can therefore export
the variable directly.

Move __read_mostly to its standard place while at it.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: add missing headers in hypercall.h
Wei Liu [Thu, 19 Sep 2019 13:04:00 +0000 (14:04 +0100)]
x86: add missing headers in hypercall.h

Include asm_defns.h because ASM_CALL_CONSTRAINT is defined there.

Include xen/lib.h because we need ASSERT_UNREACHABLE.

No functional change.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: introduce CONFIG_GUEST and move code
Wei Liu [Thu, 19 Sep 2019 12:22:05 +0000 (13:22 +0100)]
x86: introduce CONFIG_GUEST and move code

Xen is able to run as a guest on Xen. We plan to make it able to run
on Hyper-V as well.

Introduce CONFIG_GUEST which is set to true if either running on Xen
or Hyper-V is desired. Restructure code hierarchy for new code to
come.

No functional change intended.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoautomation: increase tests maximum time from 10s to 30s
Roger Pau Monne [Tue, 3 Dec 2019 10:33:52 +0000 (11:33 +0100)]
automation: increase tests maximum time from 10s to 30s

10s is too low for the clang tests, this is the output from a clang
test:

  (XEN) [    6.512748] ***************************************************
  (XEN) [    6.513323] SELFTEST FAILURE: CORRECT BEHAVIOR CANNOT BE GUARANTEED
  (XEN) [    6.513891] ***************************************************
  (XEN) [    6.514469] 3... 2... 1...
  (XEN) [    9.520011] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
  (XEN) [    9.544319] Freed 488kB init memory
  --- Xen Test Framework ---
  Environment: HVM 32bit (PAE 3 levels)
  Hello World
  Test result: SUCCESS
  (XEN) [    9.610977] Hardware Dom0 halted: halting machine

As can be seen from the output above booting Xen and the XTF test
takes ~10s, without accounting for the time it takes for QEMU to
initialize.

Increase the timeout to 30s to be on the safe side.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoautomation: add timestamps to Xen tests
Roger Pau Monne [Tue, 3 Dec 2019 10:33:51 +0000 (11:33 +0100)]
automation: add timestamps to Xen tests

Enable Xen timestamps in the automated Xen tests, this is helpful in
order to figure out if Xen is stuck or just slow in the automated
tests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agox86/tlbflush: do not toggle the PGE CR4 bit unless necessary
Roger Pau Monné [Tue, 3 Dec 2019 13:15:35 +0000 (14:15 +0100)]
x86/tlbflush: do not toggle the PGE CR4 bit unless necessary

When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.

Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: avoid HPET use on certain Intel platforms
Jan Beulich [Tue, 3 Dec 2019 13:14:44 +0000 (14:14 +0100)]
x86: avoid HPET use on certain Intel platforms

Linux commit fc5db58539b49351e76f19817ed1102bf7c712d0 says

"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
 PC10, which in consequence marks TSC as unstable because HPET is used as
 watchdog clocksource for TSC."

Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.

Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.

Adjust a few types in touched or nearby code at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agognttab: make sure grant map operations don't skip their IOMMU part
Jan Beulich [Tue, 3 Dec 2019 13:13:40 +0000 (14:13 +0100)]
gnttab: make sure grant map operations don't skip their IOMMU part

Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):

initial state: act->pin = 0

vCPU A: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)

vCPU B: progress the operation past the dropping of the locks after the
        act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)

vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
        iommu_legacy_map() invocations get skipped due to non-zero
        old_pin

vCPU B: return to caller without IOMMU mapping

vCPU A: (re-)acquire both gt locks, mapkind() returns 0,
        iommu_legacy_map() gets invoked

With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().

Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoxen/arm: initialize vpl011 flag register
Jeff Kubascik [Mon, 25 Nov 2019 20:58:00 +0000 (15:58 -0500)]
xen/arm: initialize vpl011 flag register

The tx/rx fifo flags were not set when the vpl011 is initialized. This
is a problem for certain guests that are operating in polled mode, as a
guest will generally check the rx fifo empty flag to determine if there
is data before doing a read. The result is a continuous spam of the
message "vpl011: Unexpected IN ring buffer empty" before the first valid
character is received. This initializes the flag status register to the
default specified in the PL011 technical reference manual.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoxen: arm: fix typo in the description of struct pending_irq->desc
Ian Campbell [Fri, 15 Nov 2019 20:01:06 +0000 (15:01 -0500)]
xen: arm: fix typo in the description of struct pending_irq->desc

s/it/if/ makes more sense.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agoxen: arm: fix indentation of struct vtimer
Ian Campbell [Fri, 15 Nov 2019 20:01:05 +0000 (15:01 -0500)]
xen: arm: fix indentation of struct vtimer

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Julien Grall <julien@xen.org>
5 years agox86/psr: fix bug which may cause crash
Yi Sun [Mon, 2 Dec 2019 07:24:48 +0000 (15:24 +0800)]
x86/psr: fix bug which may cause crash

During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN)    [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN)    [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN)    [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN)    [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN)    [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN)    [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN)    [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN)    [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN)    [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************

The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
    +------------------+------------------+--------------+
    | Data default val | Code default val | MBA throttle |
    +------------------+------------------+--------------+

Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: re-order clang no integrated assembler tests
Roger Pau Monne [Mon, 2 Dec 2019 11:29:46 +0000 (12:29 +0100)]
x86: re-order clang no integrated assembler tests

The tests to check whether the integrated assembler is capable of
building Xen should be performed before testing any assembler
features, or else the feature specific tests would be stale if the
integrated assembler is disabled afterwards.

Fixes: ef286f67787a ('x86: move and fix clang .skip check')
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reported-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoFix the KDD_LOG statements to use appropriate format specifier for printing uint64_t
Julian Tuminaro [Sat, 30 Nov 2019 08:11:18 +0000 (03:11 -0500)]
Fix the KDD_LOG statements to use appropriate format specifier for printing uint64_t

Previous commit in kdd.c had a small issue which lead to warning/error while compiling
on 32-bit systems due to mismatch of type size while doing type cast from uint64_t to
void *

Signed-off-by: Jenish Rakholiya <rjenish@cmu.edu>
Signed-off-by: Julian Tuminaro <jtuminar@andrew.cmu.edu>
Acked-by: Tim Deegan <tim@xen.org>
5 years agoRationalize max_grant_frames and max_maptrack_frames handling
George Dunlap [Fri, 29 Nov 2019 17:24:45 +0000 (17:24 +0000)]
Rationalize max_grant_frames and max_maptrack_frames handling

Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.

Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:

- The python libxc bindings hard-coded these values to 32 and 1024,
  respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
  calculation for grant frames: either 32 or 64, based on the max
  possible mfn.

These defaults interact poorly with the hypervisor command-line limit:

- The hypervisor command-line limit cannot be used to raise the limit
  for all guests anymore, as the default in the toolstack will
  effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
  then the "default" values generated by the toolstack are too high,
  and all guest creations will fail.

In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.

In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.

This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values.  It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.

NOTE: - The Ocaml bindings require the caller to always specify a value,
        and the code to start a xenstored stubdomain hard-codes these to 4
and 128 respectively; this behavour will not be modified.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wl@xen.org>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
5 years agokdd.c: Add support for initial handshake in KD protocol for Win 7, 8 and 10 (64 bit)
julian.tuminaro@gmail.com [Thu, 14 Nov 2019 04:55:41 +0000 (23:55 -0500)]
kdd.c: Add support for initial handshake in KD protocol for Win 7, 8 and 10 (64 bit)

Current implementation of find_os is based on the hard-coded values for
different Windows version. It uses the value for get the address to
start looking for DOS header in the given specified range. However, this
is not scalable to all version of Windows as it will require us to keep
adding new entries and also due to KASLR, chances of not hitting the PE
header is significant. We implement a way for 64-bit systems to use IDT
entry to get a valid exception/interrupt handler and then move back into
the memory to find the valid DOS header. Since IDT entries are protected
by PatchGuard, we think our assumption that IDT entries will not be
corrupted is valid for our purpose. Once we have the image base, we
search for the DBGKD_GET_VERSION64 structure type in .data section to
get information required for handshake.

Currently, this is a work in progress feature and current patch only
supports the handshake and memory read/write on 64-bit systems.

NOTE: This is the Updated version of the previous patch submitted
NOTE: This has currently been only tested when debugging was not enabled
on the guest Windows.

Signed-off-by: Jenish Rakholiya <rjenish@cmu.edu>
Signed-off-by: Julian Tuminaro <jtuminar@andrew.cmu.edu>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Paul Durrant <paul@xen.org>
5 years agopassthrough: simplify locking and logging
Paul Durrant [Fri, 15 Nov 2019 18:59:30 +0000 (18:59 +0000)]
passthrough: simplify locking and logging

Dropping the pcidevs lock between calling device_assigned() and
assign_device() means that the latter has to do the same check as the
former for no obvious gain. Also, since long running operations under
pcidevs lock already drop the lock and return -ERESTART periodically there
is little point in immediately failing an assignment operation with
-ERESTART just because the pcidevs lock could not be acquired (for the
second time, having already blocked on acquiring the lock in
device_assigned()).

This patch instead acquires the lock once for assignment (or test assign)
operations directly in iommu_do_pci_domctl() and thus can remove the
duplicate domain ownership check in assign_device(). Whilst in the
neighbourhood, the patch also removes some debug logging from
assign_device() and deassign_device() and replaces it with proper error
logging, which allows error logging in iommu_do_pci_domctl() to be
removed.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoAMD/IOMMU: Render IO_PAGE_FAULT errors in a more useful manner
Andrew Cooper [Tue, 26 Nov 2019 14:37:27 +0000 (14:37 +0000)]
AMD/IOMMU: Render IO_PAGE_FAULT errors in a more useful manner

Print the PCI coordinates in its common format and use d%u notation for the
domain.  As well as printing flags, decode them.  IO_PAGE_FAULT is used for
interrupt remapping errors as well as DMA remapping errors.

Before:
  (XEN) AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0xa1, fault address = 0xbf695000, flags = 0x10
  (XEN) AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0xa1, fault address = 0xbf695040, flags = 0x10
  (XEN) AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0xa1, fault address = 0xfffffff0, flags = 0x30
  (XEN) AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0xa1, fault address = 0x100000000, flags = 0x30
  (XEN) AMD-Vi: IO_PAGE_FAULT: domain = 0, device id = 0xa1, fault address = 0x100000040, flags = 0x30

After:
  (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:00:14.1 d0 addr 00000000bf5fc000 flags 0x10 PR
  (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:00:14.1 d0 addr 00000000bf5fc040 flags 0x10 PR
  (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:00:14.1 d0 addr 00000000fffffff0 flags 0x30 RW PR
  (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:00:14.1 d0 addr 0000000100000000 flags 0x30 RW PR
  (XEN) AMD-Vi: IO_PAGE_FAULT: 0000:00:14.1 d0 addr 0000000100000040 flags 0x30 RW PR

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agoAMD/IOMMU: Always print IOMMU errors
Andrew Cooper [Tue, 26 Nov 2019 14:08:01 +0000 (14:08 +0000)]
AMD/IOMMU: Always print IOMMU errors

Unhandled IOMMU errors (i.e. not IO_PAGE_FAULT) should still be printed, and
not hidden behind iommu=debug.

While adjusting this, factor out the symbolic name handling to just one
location exposing its off-by-one nature.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
5 years agox86 / iommu: set up a scratch page in the quarantine domain
Paul Durrant [Wed, 27 Nov 2019 17:11:43 +0000 (17:11 +0000)]
x86 / iommu: set up a scratch page in the quarantine domain

This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.

The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.

NOTE: These modifications are restricted to x86 implementations only as
      the buggy h/w I am aware of is only used with Xen in an x86
      environment. ARM may require similar code but, since I am not
      aware of the need, this patch does not modify any ARM implementation.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
5 years agoxen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed
Julien Grall [Thu, 28 Nov 2019 09:38:28 +0000 (09:38 +0000)]
xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed

A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.

This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.

As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.

We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().

NOTE: - The call to vpmu_destroy() must also be moved from
        arch_vcpu_destroy() into domain_relinquish_resources() such that
        the reference on the mapped page does not prevent domain_destroy()
        (which calls arch_vcpu_destroy()) from being called.
      - Whilst it appears that vpmu_arch_destroy() is idempotent it is
        by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
        flag is cleared at the end of vpmu_arch_destroy().
      - This is not an XSA because vPMU is not security supported (see
        XSA-163).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
5 years agox86: move and fix clang .skip check
Roger Pau Monné [Fri, 29 Nov 2019 16:10:26 +0000 (17:10 +0100)]
x86: move and fix clang .skip check

.skip is only used by x86 code, so place the clang .skip with labels
check in x86/Rules.mk instead of the top level Rules.mk. While there
also fix an issue with it by removing the '\n' which triggers the
following error:

<stdin>:1:31: error: missing terminating '"' character [-Werror,-Winvalid-pp-token]
void _(void) { asm volatile ( ".L0:
                              ^
<stdin>:1:31: error: expected string literal in 'asm'
<stdin>:3:18: error: missing terminating '"' character [-Werror,-Winvalid-pp-token]
.skip (.L1 - .L0)" ); }
                 ^
<stdin>:3:24: error: expected ')'
.skip (.L1 - .L0)" ); }
                       ^
<stdin>:1:29: note: to match this '('
void _(void) { asm volatile ( ".L0:
                            ^
<stdin>:3:24: error: expected '}'
.skip (.L1 - .L0)" ); }
                       ^
<stdin>:1:14: note: to match this '{'
void _(void) { asm volatile ( ".L0:
             ^
5 errors generated.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Roger Pau Monné <roger.pau@citrix.com> [On FreeBSD and Debian 9.5]
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agox86: fix clang .macro retention check
Jan Beulich [Fri, 29 Nov 2019 16:10:00 +0000 (17:10 +0100)]
x86: fix clang .macro retention check

There were two problems here: The first closing parentheses got parsed
by make to end the $(call invocation, and the escaping of the quotes
wasn't right either, as there's nowhere they would get un-escaped.

Furthermore there appears to be a puzzling problem with \n getting
expanded to an actual newline too early in some environments. Convert
these to semicolons at the same time.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Roger Pau Monné <roger.pau@citrix.com> [On FreeBSD and Debian 9.5]
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 years agoconsole: avoid buffer overrun in guest_console_write()
Jan Beulich [Fri, 29 Nov 2019 16:09:16 +0000 (17:09 +0100)]
console: avoid buffer overrun in guest_console_write()

conring_puts() has been requiring a nul-terminated string, which the
local kbuf[] doesn't get set for anymore. Add a length parameter to the
function, just like was done for others, thus allowing embedded nul to
also be read through XEN_SYSCTL_readconsole.

While there drop a stray cast: Both operands of - are already uint32_t.

Fixes: ea601ec9995b ("xen/console: Rework HYPERCALL_console_io interface")
Reported-by: Jürgen Groß <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <julien@xen.org>