Roger Pau Monné [Fri, 20 Dec 2019 15:26:09 +0000 (16:26 +0100)]
x86/ioapic: only use dest32 with x2apic and interrupt remapping enabled
The IO-APIC code assumes that x2apic being enabled also implies
interrupt remapping being enabled, and hence will use the 32bit
destination field in the IO-APIC entry.
This is safe now, but there's no reason to not enable x2APIC even
without interrupt remapping, and hence the IO-APIC code needs to use
the 32 bit destination field only when both interrupt remapping and
x2APIC are enabled.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()
handle_x86_pv_info() has a subtle bug. It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally. In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.
Rework the logic a little to be simpler. There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.
Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 18 Dec 2019 14:00:16 +0000 (14:00 +0000)]
tools/python: Python 3 compatibility
convert-legacy-stream is only used for incomming migration from pre Xen 4.7,
and verify-stream-v2 appears to only be used by me during migration
development - it is little surprise that they missed the main converstion
effort in Xen 4.13.
Fix it all up.
Move open_file_or_fd() into a new util.py to avoid duplication, making it a
more generic wrapper around open() or fdopen().
In libxc.py, drop all long() conversion. Python 2 will DTRT with int => long
promotion, even on 32bit builds.
In convert-legacy-stream, don't pass empty strings to write_record(). Join on
the empty argl will do the right thing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Andrew Cooper [Wed, 18 Dec 2019 12:43:48 +0000 (12:43 +0000)]
tools/python: Drop test.py
This file hasn't been touched since it was introduced in 2005 (c/s 0c6f36628)
and has a wildly obsolete shebang for Python 2.3. Most importantly for us is
that it isn't Python 3 compatible.
Drop the file entirely. Since the 2.3 days, automatic discovery of tests has
been included in standard functionality. Rewrite the test rule to use
"$(PYTHON) -m unittest discover" which is equivelent.
Dropping test.py drops the only piece of ZPL-2.0 code in the tree. Drop the
ancillary files, and adjust COPYING to match.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Tamas K Lengyel [Wed, 18 Dec 2019 19:40:41 +0000 (11:40 -0800)]
x86/mem_sharing: cleanup code and comments in various locations
No functional changes.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Further cleanup] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tamas K Lengyel [Wed, 18 Dec 2019 19:40:40 +0000 (11:40 -0800)]
tools/libxc: clean up memory sharing files
No functional changes.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Wei Liu <wl@xen.org>
[Further cleanup] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 18 Dec 2019 13:49:59 +0000 (14:49 +0100)]
x86: provide Dom0 access to PPIN via XENPF_resource_op
It was requested that we provide a way independent of the MCE reporting
interface that Dom0 software could use to get hold of the values for
particular CPUs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 18 Dec 2019 13:49:10 +0000 (14:49 +0100)]
x86: include the PPIN in MCE records when available
Quoting the respective Linux commit:
Intel Xeons from Ivy Bridge onwards support a processor identification
number set in the factory. To the user this is a handy unique number to
identify a particular CPU. Intel can decode this to the fab/production
run to track errors. On systems that have it, include it in the machine
check record. I'm told that this would be helpful for users that run
large data centers with multi-socket servers to keep track of which CPUs
are seeing errors.
Newer AMD CPUs support this too, at different MSR numbers.
Take the opportunity and hide __MC_NMSRS from the public interface going
forward.
Andrew Cooper [Fri, 13 Dec 2019 17:56:40 +0000 (17:56 +0000)]
x86/S3: Restore cr4 later during resume
Just like the BSP/AP paths, %cr4 is loaded with only PAE. Defer restoring all
of %cr4 (MCE in particular) until all the system structures (IDT/TSS in
particular) have been loaded.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 13 Dec 2019 17:52:21 +0000 (17:52 +0000)]
x86/S3: Don't save unnecessary GPRs
Only the callee-preserved registers need saving/restoring. Spill them to the
stack like regular functions do. %rsp is now the only GPR which gets stashed
in .data
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 13 Dec 2019 17:36:09 +0000 (17:36 +0000)]
x86/S3: Clarify and improve the behaviour of do_suspend_lowlevel()
do_suspend_lowlevel() behaves as a function call, even when the trampoline
jumps back into the middle of it. Discuss this property, while renaming the
far-too-generic __ret_point to s3_resume.
Optimise the calling logic for acpi_enter_sleep_state(). $3 doesn't require a
64bit write, and the function isn't variadic so doesn't need to specify zero
FPU registers in use.
In the case of an acpi_enter_sleep_state() error, we didn't actually lose
state so don't need to restore it. Jump straight to the end.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
The existing code assumes that the first mfn passed to the boot
allocator is mapped, which creates problems when, e.g., we do not have
a direct map, and may create other bootstrapping problems in the
future. Make it static. The size is kept the same as before (1 page).
Nick Rosbrook [Mon, 16 Dec 2019 18:08:10 +0000 (18:08 +0000)]
golang/xenlight: implement keyed union C to Go marshaling
Switch over union key to determine how to populate 'union' in Go struct.
Since the unions of C types cannot be directly accessed in cgo, use a
typeof trick to typedef a struct in the cgo preamble that is analagous
to each inner struct of a keyed union. For example, to define a struct
for the hvm inner struct of libxl_domain_build_info, do:
Nick Rosbrook [Mon, 16 Dec 2019 18:08:09 +0000 (18:08 +0000)]
golang/xenlight: begin C to Go type marshaling
Begin implementation of fromC marshaling functions for generated struct
types. This includes support for converting fields that are basic
primitive types such as string and integer types, nested anonymous
structs, nested libxl structs, and libxl built-in types.
This patch does not implement conversion of arrays or keyed unions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Nick Rosbrook [Mon, 16 Dec 2019 18:08:08 +0000 (18:08 +0000)]
golang/xenlight: generate structs from the IDL
Add struct and keyed union generation to gengotypes.py. For keyed unions,
use a method similar to gRPC's oneof to interpret C unions as Go types.
Meaning, for a given struct with a union field, generate a struct for
each sub-struct defined in the union. Then, define an interface of one
method which is implemented by each of the defined sub-structs. For
example:
type domainBuildInfoTypeUnion interface {
isdomainBuildInfoTypeUnion()
}
type DomainBuildInfoTypeUnionHvm struct {
// HVM-specific fields...
}
Then, remove existing struct definitions in xenlight.go that conflict
with the generated types, and modify existing marshaling functions to
align with the new type definitions. Notably, drop "time" package since
fields of type time.Duration are now of type uint64.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Re-name and modify signature of toGo function to fromC. The reason for
using 'fromC' rather than 'toGo' is that it is not a good idea to define
methods on the C types. Also, add error return type to Bitmap's toC function.
Finally, as code-cleanup, re-organize the Bitmap type's comments as per
Go conventions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
--
Changes in v2:
- Use consistent variable naming for slice created from
libxl_bitmap.
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: define Defbool builtin type
Define Defbool as struct analagous to the C type, and define the type
'defboolVal' that represent true, false, and default defbool values.
Implement Set, Unset, SetIfDefault, IsDefault, Val, and String functions
on Defbool so that the type can be used in Go analagously to how its
used in C.
Finally, implement fromC and toC functions.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Nick Rosbrook [Mon, 16 Dec 2019 18:07:59 +0000 (18:07 +0000)]
golang/xenlight: generate enum types from IDL
Introduce gengotypes.py to generate Go code the from IDL. As a first step,
implement 'enum' type generation.
As a result of the newly-generated code, remove the existing, and now
conflicting definitions in xenlight.go. In the case of the Error type,
rename the slice 'errors' to 'libxlErrors' so that it does not conflict
with the standard library package 'errors.' And, negate the values used
in 'libxlErrors' since the generated error values are negative.
Signed-off-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Mon, 16 Dec 2019 16:37:09 +0000 (17:37 +0100)]
x86emul: correct far branch handling for 64-bit mode
AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
The version of this header present in the Linux source tree has contained
such macros for some time. These macros, as the names imply, allow front
or back rings to be set up for existent (rather than freshly created and
zeroed) shared rings.
This patch is to update this, the canonical version of the header, to
match the latest definition of these macros in the Linux source.
NOTE: The way the new macros are defined allows the FRONT/BACK_RING_INIT
macros to be re-defined in terms of them, thereby reducing
duplication.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 16 Dec 2019 16:35:50 +0000 (17:35 +0100)]
x86emul: correct LFS et al handling for 64-bit mode
AMD and friends explicitly specify that 64-bit operands aren't possible
for these insns. Nevertheless REX.W isn't fully ignored: It still
cancels a possible operand size override (0x66). Intel otoh explicitly
provides for 64-bit operands on the respective insn page of the SDM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Dec 2019 16:34:46 +0000 (17:34 +0100)]
x86emul: correct segment override decode for 64-bit mode
The legacy / compatibility mode ES, CS, SS, and DS overrides are fully
ignored prefixes in 64-bit mode, i.e. they in particular don't cancel an
earlier FS or GS one. (They don't violate the REX-prefix-must-be-last
rule though.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Fri, 13 Dec 2019 22:48:01 +0000 (22:48 +0000)]
x86/time: drop vtsc_{kern, user}count debug counters
They either need to be transformed to atomics to work correctly
(currently they left unprotected for HVM domains) or dropped entirely
as taking a per-domain spinlock is too expensive for high-vCPU count
domains even for debug build given this lock is taken too often.
Choose the latter as they are not extremely important anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 16 Dec 2019 13:58:45 +0000 (13:58 +0000)]
x86/pv: Fix `global-pages` to match the documentation
c/s 5de961d9c09 "x86: do not enable global pages when virtualized on AMD or
Hygon hardware" in fact does. Fix the calculation in pge_init().
While fixing this, adjust the command line documenation, first to use the
newer style, and to expand the description to discuss cases where the option
might be useful to use, but Xen can't account for by default.
Fixes: 5de961d9c09 ('x86: do not enable global pages when virtualized on AMD or Hygon hardware') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Thu, 12 Dec 2019 15:57:51 +0000 (15:57 +0000)]
x86/mm: More discriptive names for page de/validation functions
The functions alloc_page_type(), alloc_lN_table(), free_page_type()
and free_lN_table() are confusingly named: nothing is being allocated
or freed. Rather, the page being passed in is being either validated
or devalidated for use as the specific type; in the specific case of
pagetables, these may be promoted or demoted (i.e., grab appropriate
references for PTEs).
Rename alloc_page_type() and free_page_type() to validate_page() and
devalidate_page(). Also rename alloc_segdesc_page() to
validate_segdesc_page(), since this is what it's doing.
Rename alloc_lN_table() and free_lN_table() to promote_lN_table() and
demote_lN_table(), respectively.
After this change:
- get / put type consistenly refer to increasing or decreasing the count
- validate / devalidate consistently refers to actions done when a
type count goes 0 -> 1 or 1 -> 0
- promote / demote consistenly refers to acquiring or freeing
resources (in the form of type refs and general references) in order
to allow a page to be used as a pagetable.
No functional change.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 14:09:46 +0000 (14:09 +0000)]
x86/mm: Use mfn_t in type get / put call tree
Replace `unsigned long` with `mfn_t` as appropriate throughout
alloc/free_lN_table, get/put_page_from_lNe, and
get_lN_linear_pagetable. This obviates the need for a load of
`mfn_x()` and `_mfn()` casts.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Use a more descriptive name for pagetable mfns
In many places, a PTE being modified is accompanied by the pagetable
mfn which contains the PTE (primarily in order to be able to maintain
linear mapping counts). In many cases, this mfn is stored in the
non-descript variable (or argement) "pfn".
Replace these names with lNmfn, to indicate that 1) this is a
pagetable mfn, and 2) that it is the same level as the PTE in
question. This should be enough to remind readers that it's the mfn
containing the PTE.
No functional change.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Implement common put_data_pages for put_page_from_l[23]e
Both put_page_from_l2e and put_page_from_l3e handle having superpage
entries by looping over each page and "put"-ing each one individually.
As with putting page table entries, this code is functionally
identical, but for some reason different. Moreover, there is already
a common function, put_data_page(), to handle automatically swapping
between put_page() (for read-only pages) or put_page_and_type() (for
read-write pages).
Replace this with put_data_pages() (plural), which does the entire
loop, as well as the put_page / put_page_and_type switch.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Fri, 13 Dec 2019 12:53:04 +0000 (12:53 +0000)]
x86/mm: Refactor put_page_from_l*e to reduce code duplication
put_page_from_l[234]e have identical functionality for devalidating an
entry pointing to a pagetable. But mystifyingly, they duplicate the
code in slightly different arrangements that make it hard to tell that
it's the same.
Create a new function, put_pt_page(), which handles the common
functionality; and refactor all the functions to be symmetric,
differing only in the level of pagetable expected (and in whether they
handle superpages).
Other than put_page_from_l2e() gaining an ASSERT it probably should
have had already, no functional changes.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Fri, 13 Dec 2019 16:39:44 +0000 (16:39 +0000)]
public/io/netif.h: document a mechanism to advertise carrier state
This patch adds a specification for a 'carrier' node in xenstore to allow
a backend to notify a frontend of it's virtual carrier/link state. E.g.
a backend that is unable to forward packets from the guest because it is
not attached to a bridge may wish to advertise 'no carrier'.
While in the area also fix an erroneous backend path description.
NOTE: This is purely a documentation patch. No functional change.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Thu, 12 Dec 2019 18:27:34 +0000 (18:27 +0000)]
Config.mk: Remove stray comment
This comment isn't about CONFIG_TESTS, but about SEABIOS_DIR that has
been removed.
Originally, the comment was added by 5f82d0858de1 ("tools: support
SeaBIOS. Use by default when upstream qemu is configured."), then
later the SEABIOS_DIR was removed by 14ee3c05f3ef ("Clone and build
Seabios by default") but that comment about the pain was left behind.
The commit that made CONFIG_TESTS painful was 85896a7c4dc7 ("build:
add autoconf to replace custom checks in tools/check").
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Thu, 12 Dec 2019 18:27:33 +0000 (18:27 +0000)]
Config.mk: Remove unused setvar_dir macro
And remove all mention of it in docs. It hasn't been used since 9ead9afcb935 ("Add configure --with-sysconfig-leaf-dir=SUBDIR to set
CONFIG_LEAF_DIR").
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Extend the livepatch list operation to fetch also payloads' metadata.
This is achieved by extending the sysctl list interface with 2 extra
guest handles:
* metadata - an array of arbitrary size strings
* metadata_len - an array of metadata strings' lengths (uin32_t each)
Payloads' metadata is a string of arbitrary size and does not have an
upper bound limit. It may also vary in size between payloads.
In order to let the userland allocate enough space for the incoming
data add a metadata total size field to the list sysctl operation and
fill it with total size of all payloads' metadata.
Extend the libxc to handle the metadata back-to-back data transfers
as well as metadata length array data transfers.
The xen-livepatch userland tool is extended to always display the
metadata for each received module. The metadata is received with the
following format: key=value\0key=value\0...key=value\0. The format is
modified to the following one: key=value;key=value;...key=value.
The new format allows to easily parse the metadata for a given module
by a machine.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Handle arbitrary size names with the list operation
The payloads' name strings can be of arbitrary size (typically small
with an upper bound of XEN_LIVEPATCH_NAME_SIZE).
Current implementation of the list operation interface allows to copy
names in the XEN_LIVEPATCH_NAME_SIZE chunks regardless of its actual
size and enforces space allocation requirements on userland tools.
To unify and simplify the interface, handle the name strings of
arbitrary size by copying them in adhering chunks to the userland.
In order to let the userland allocate enough space for the incoming
data add an auxiliary interface xc_livepatch_list_get_sizes() that
provides the current number of payload entries and the total size of
all name strings. This is achieved by extending the sysctl list
interface with an extra fields: name_total_size.
The xc_livepatch_list_get_sizes() issues the livepatch sysctl list
operation with the nr field set to 0. In this mode the operation
returns the number of payload entries and calculates the total sizes
for all payloads' names.
When the sysctl operation is issued with a non-zero nr field (for
instance with a value obtained earlier with the prior call to the
xc_livepatch_list_get_sizes()) the new field name_total_size provides
the total size of actually copied data.
Extend the libxc to handle the name back-to-back data transfers.
The xen-livepatch tool is modified to start the list operation with a
call to the xc_livepatch_list_get_sizes() to obtain the actual number
of payloads as well as the necessary space for names.
The tool now always requests the actual number of entries and leaves
the preemption handling to the libxc routine. The libxc still returns
'done' and 'left' parameters with the same semantic allowing the tool
to detect anomalies and react to them. At the moment it is expected
that the tool receives the exact number of entries as requested.
The xen-livepatch tool has been also modified to handle the name
back-to-back transfers correctly.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Add support for modules .modinfo section metadata
Having detailed livepatch metadata helps to properly identify module's
origin and version. It also allows to keep track of the history of
livepatch loads in the system (at least within dmesg buffer size
limits).
The livepatch metadata are embedded in a form of .modinfo section.
Each such section contains data of the following format:
key=value\0key=value\0...key=value\0
The .modinfo section may be generated and appended to the resulting
livepatch ELF file optionally as an extra step of a higher level
livepatch build system.
The metadata section pointer and the section length is stored in the
livepatch payload structure and is used to display the content upon
livepatch apply operation.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Add support for inline asm livepatching expectations
This is the initial implementation of the expectations enhancement
to improve inline asm livepatching.
Expectations are designed as optional feature, since the main use of
them is planned for inline asm livepatching. The flag enabled allows
to control the expectation state.
Each expectation has data and len fields that describe the data
that is expected to be found at a given patching (old_addr) location.
The len must not exceed the data array size. The data array size
follows the size of the opaque array, since the opaque array holds
the original data and therefore must match what is specified in the
expectation (if enabled).
The payload structure is modified as each expectation structure is
part of the livepatch_func structure and hence extends the payload.
Each expectation is checked prior to the apply action (i.e. as late
as possible to check against the most current state of the code).
For the replace action a new payload's expectations are checked AFTER
all applied payloads are successfully reverted, but BEFORE new payload
is applied. That breaks the replace action's atomicity and in case of
an expectation check failure would leave a system with all payloads
reverted. That is obviously insecure. Use it with caution and act
upon replace errors!
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Add per-function applied/reverted state tracking marker
Livepatch only tracks an entire payload applied/reverted state. But,
with an option to supply the apply_payload() and/or revert_payload()
functions as optional hooks, it becomes possible to intermix the
execution of the original apply_payload()/revert_payload() functions
with their dynamically supplied counterparts.
It is important then to track the current state of every function
being patched and prevent situations of unintentional double-apply
or unapplied revert.
To support that, it is necessary to extend public interface of the
livepatch. The struct livepatch_func gets additional field holding
the applied/reverted state marker.
To reflect the livepatch payload ABI change, bump the version flag
LIVEPATCH_PAYLOAD_VERSION up to 2.
[And also update the top of the design document]
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Do not enforce ELF_LIVEPATCH_FUNC section presence
With default implementation the ELF_LIVEPATCH_FUNC section containing
all functions to be replaced or added must be part of the livepatch
payload, otherwise the payload is rejected (with -EINVAL).
However, with the extended hooks implementation, a livepatch may be
constructed of only hooks to perform certain actions without any code
to be added or replaced.
Therefore, do not always expect the functions section and allow it to
be missing, provided there is at least one section containing hooks
present. The functions section, when present in a payload, must be a
single, non-empty section.
Check also all extended hooks sections if they are a single, non-empty
sections each.
At least one of the functions or hooks section must be present in a
valid payload.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Add support for apply|revert action replacement hooks
By default, in the quiescing zone, a livepatch payload is applied with
apply_payload() and reverted with revert_payload() functions. Both of
the functions receive the payload struct pointer as a parameter. The
functions are also a place where standard 'load' and 'unload' module
hooks are executed.
To increase livepatching system's agility and provide more flexible
long-term livepatch solution, allow to overwrite the default apply
and revert action functions with hook-like supplied alternatives.
The alternative functions are optional and the default functions are
used by default.
Since the alternative functions have direct access to the livepatch
payload structure, they can better control context of the 'load' and
'unload' hooks execution as well as exact instructions replacement
workflows. They can be also easily extended to support extra features
in the future.
To simplify the alternative function generation move code responsible
for payload and livepatch region registration outside of the function.
That way it is guaranteed that the registration step occurs even for
newly supplied functions.
This is an implementation of 4 new livepatch module vetoing hooks,
that can be optionally supplied along with modules.
Hooks that currently exists in the livepatch mechanism aren't agile
enough and have various limitations:
* run only from within a quiescing zone
* cannot conditionally prevent applying or reverting
* do not have access to the module context
To address these limitations the following has been implemented:
1) pre-apply hook
runs before the apply action is scheduled for execution. Its main
purpose is to prevent from applying a livepatch when certain
expected conditions aren't met or when mutating actions implemented
in the hook fail or cannot be executed.
2) post-apply hook
runs after the apply action has been executed and quiescing zone
exited. Its main purpose is to provide an ability to follow-up on
actions performed by the pre- hook, when module application was
successful or undo certain preparation steps of the pre- hook in
case of a failure. The success/failure error code is provided to
the post- hooks via the rc field of the payload structure.
3) pre-revert hook
runs before the revert action is scheduled for execution. Its main
purpose is to prevent from reverting a livepatch when certain
expected conditions aren't met or when mutating actions implemented
in the hook fail or cannot be executed.
4) post-revert hook
runs after the revert action has been executed and quiescing zone
exited. Its main purpose is to perform cleanup of all previously
executed mutating actions in order to restore the original system
state from before the current module application.
The success/failure error code is provided to the post- hooks via
the rc field of the payload structure.
The replace action performs atomically the following actions:
- revert all applied modules
- apply a single replacement module.
With the vetoing hooks in place various inter-hook dependencies may
arise. Also, during the revert part of the operation certain vetoing
hooks may detect failing conditions that previously were satisfied.
That could in turn lead to situation when the revert part must be
rolled back with all the pre- and post- hooks re-applied, which again
can't be guaranteed to always succeed.
The simplest response to this complication is to disallow the replace
action completely on modules with vetoing hooks.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Petre Eftime <epetre@amazon.com> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Export payload structure via livepatch_payload.h
The payload structure will be used by the new hooks implementation and
therefore its definition has to be exported via the livepatch_payload
header.
The new hooks will make use of the payload structure fields and the
hooks' pointers will also be defined in the payload structure, so
the structure along with all field definitions needs to be available
to the code being patched in.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Eslam Elnikety <elnikety@amazon.de> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
livepatch: Allow to override inter-modules buildid dependency
By default Livepatch enforces the following buildid-based dependency
chain between livepatch modules:
1) first module depends on given hypervisor buildid
2) every consecutive module depends on previous module's buildid
This way proper livepatch stack order is maintained and enforced.
While it is important for production livepatches it limits agility and
blocks usage of testing or debug livepatches. These kinds of livepatch
modules are typically expected to be loaded at any time irrespective
of current state of the modules stack.
To enable testing and debug livepatches allow user dynamically ignore
the inter-modules dependency. In this case only hypervisor buildid
match is verified and enforced.
To allow userland pass additional paremeters for livepatch actions
add support for action flags.
Each of the apply, revert, unload and revert action gets additional
32-bit parameter 'flags' where extra flags can be applied in a mask
form.
Initially only one flag '--nodeps' is added for the apply action.
This flag modifies the default buildid dependency check as described
above.
The global sysctl interface input flag parameter is defined with a
single corresponding flag macro:
LIVEPATCH_ACTION_APPLY_NODEPS (1 << 0)
The userland xen-livepatch tool is modified to support the '--nodeps'
flag for apply and load commands. A general mechanism for specifying
more flags in the future for apply and other action is however added.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Eslam Elnikety <elnikety@amazon.de> Reviewed-by: Petre Eftime <epetre@amazon.com> Reviewed-by: Leonard Foerster <foersleo@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Norbert Manthey <nmanthey@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
livepatch: Always check hypervisor build ID upon livepatch upload
This change is part of a independant stacked livepatch modules
feature. This feature allows to bypass dependencies between modules
upon loading, but still verifies Xen build ID matching.
In order to prevent (up)loading any livepatches built for different
hypervisor version as indicated by the Xen Build ID, add checking for
the payload's vs Xen's build id match.
To achieve that embed into every livepatch another section with a
dedicated hypervisor build id in it. After the payload is loaded and
the .livepatch.xen_depends section becomes available, perform the
check and reject the payload if there is no match.
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Andra-Irina Paraschiv <andraprs@amazon.com> Reviewed-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Eslam Elnikety <elnikety@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Juergen Gross [Wed, 11 Dec 2019 16:56:59 +0000 (17:56 +0100)]
build: fix tools/configure in case only python3 exists
Calling ./configure with python3 being there but no python,
tools/configure will fail. Fix that by defaulting to python and
falling back to python3 or python2.
While at it fix the use of non portable "type -p" by replacing it by
AC_PATH_PROG().
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org>
[ wei: run autogen.sh ] Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 11 Dec 2019 13:55:32 +0000 (14:55 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
update_paging_mode() has multiple bugs:
1) Booting with iommu=debug will cause it to inform you that that it called
without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
table(s) in the case of a further error.
Furthermore, the choice of default level for a domain has issues:
1) All HVM guests grow from 2 to 3 levels during construction because of the
position of the VRAM just below the 4G boundary, so defaulting to 2 is a
waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
isn't dynamic at runtime like HVM guests. This means that a PV guest may
get RAM which it can't map in the IOMMU.
The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement. Remove
the complexity by removing the dynamic height.
PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.
HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.
The overhead of this extra level is not expected to be noticeable. It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.
This is XSA-311.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Mon, 28 Oct 2019 14:33:51 +0000 (14:33 +0000)]
x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial
The PGT_partial bit in page->type_info holds both a type count and a
general ref count. During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial. When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():
As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().
(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)
Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Thu, 31 Oct 2019 11:17:38 +0000 (11:17 +0000)]
x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR
When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.
One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).
Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR. This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set. In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.
Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.
NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i. On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 19 Nov 2019 11:40:34 +0000 (11:40 +0000)]
x86/mm: Set old_guest_table when destroying vcpu pagetables
Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set. If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.
One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set. The
result was that if such an operation were interrupted, Xen would hit a
BUG().
Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately
While here, do some refactoring:
- Move clearing of arch.cr3 to the top of the function
- Now that clearing is unconditional, move the unmap to the same
conditional as the l4tab mapping. This also allows us to reduce
the scope of the l4tab variable.
- Avoid code duplication by looping to drop references on
guest_table_user
This is part of XSA-310.
Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Wed, 30 Oct 2019 17:05:28 +0000 (17:05 +0000)]
x86/mm: Don't reset linear_pt_count on partial validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.
On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:
Assertion 'oc > 0' failed at mm.c:874
Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.
This is XSA-309.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Dec 2019 13:09:30 +0000 (14:09 +0100)]
x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures
See patch comment for technical details.
Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.
After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.
A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.
This is XSA-308
Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Wed, 11 Dec 2019 13:06:18 +0000 (14:06 +0100)]
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior
These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.
Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.
This is XSA-307.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien@xen.org>
Andrew Cooper [Sat, 7 Dec 2019 15:50:22 +0000 (15:50 +0000)]
xen/build: Automatically locate a suitable python interpreter
Needing to pass PYTHON=python3 into hypervisor builds is irritating and
unnecessary. Locate a suitable interpreter automatically, defaulting to Py3
if it is available.
Reported-by: Steven Haigh <netwiz@crc.id.au> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Fixing this is easy, but using python here is wasteful. compile.h doesn't
need XEN_BANNER rendering in octal, and text is much more simple to handle.
Replace fig-to-oct.py with a smaller sed script. This could be a shell
one-liner, but it is much more simple to comment sensibly, and doesn't need to
include the added cognative load of makefile and shell escaping.
While changing this logic, take the opportunity to optimise the banner
space (and time on the serial port) by dropping trailing whitespace, which is
84 characters for current staging.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Sat, 7 Dec 2019 16:20:55 +0000 (16:20 +0000)]
xen/flask: Drop the gen-policy.py script
The script is Python 2 specific, and fails with string/binary issues with
Python 3:
Traceback (most recent call last):
File "gen-policy.py", line 14, in <module>
for char in sys.stdin.read():
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 0: invalid start byte
Fixing the script to be compatible isn't hard, but using python here is
wasteful. Drop the script entirely, and write an equivelent flask-policy.S
instead. This removes the need for a $(PYTHON) and $(CC) pass.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Julien Grall <julien@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Tue, 10 Dec 2019 10:34:00 +0000 (11:34 +0100)]
x86: do not enable global pages when virtualized on AMD or Hygon hardware
When using global pages a full tlb flush can only be performed by
toggling the PGE bit in CR4, which is usually quite expensive in terms
of performance when running virtualized. This is specially relevant on
AMD or Hygon hardware, which doesn't have the ability to do selective
CR4 trapping, but can also be relevant on e.g. Intel if the underlying
hypervisor also traps accesses to the PGE CR4 bit.
In order to avoid this performance penalty, do not use global pages
when running virtualized on AMD or Hygon hardware. A command line option
'global-pages' is provided in order to allow the user to select whether
global pages will be enabled for PV guests.
The above figures are from a PV shim running on AMD hardware with
32 vCPUs:
Average lock time: 240829ns
Average block time: 1453029ns
As seen from the above figures the lock and block time of the flush
lock is reduced to approximately 1/3 of the original value.
Note that XEN_MINIMAL_CR4 and mmu_cr4_features are not modified, and
thus global pages are left enabled for the hypervisor. This is not an
issue because the code to switch the control registers (cr3 and cr4)
already takes into account such situation and performs the necessary
flushes. The same already happens when using XPTI or PCIDE, as the
guest cr4 doesn't have global pages enabled in that case either.
Also note that the suspend and resume code is correct in writing
mmu_cr4_features into cr4 on resume, since that's the cr4 used by the
idle vCPU which is the context used by the suspend and resume routine.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Igor Druzhinin [Tue, 10 Dec 2019 10:07:22 +0000 (11:07 +0100)]
x86/AMD: unbreak CPU hotplug on AMD systems without RstrFpErrPtrs
If the feature is not present Xen will try to force X86_BUG_FPU_PTRS
feature at CPU identification time. This is especially noticeable in
PV-shim that usually hotplugs its vCPUs. We either need to restrict this
action for boot CPU only or allow secondary CPUs to modify
forced CPU capabilities at runtime. Choose the former since modifying
forced capabilities out of boot path leaves the system in potentially
inconsistent state.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
There's no reason to allocate the dec{32,64}table on the stack; it
just wastes a bunch of instructions setting them up and, of course,
also consumes quite a bit of stack. Using size_t for such small
integers is a little excessive.
The now inlined lz4_uncompress functions used to have a stack
footprint of 176 bytes (according to -fstack-usage); their inlinees
have increased their stack use from 32 bytes to 48 and 80 bytes,
respectively.
Jan Beulich [Mon, 9 Dec 2019 13:01:25 +0000 (14:01 +0100)]
lz4: refine commit 9143a6c55ef7 for the 64-bit case
I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to
cpy = op + length - (STEPSIZE - 4);
where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.
Reported-by: Mark Pryor <pryorm09@gmail.com> Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Mark Pryor <pryorm09@gmail.com> Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 11 Sep 2019 19:12:31 +0000 (20:12 +0100)]
docs/sphinx: License content with CC-BY-4.0
Creative Commons is a more common license than GPL for documentation purposes.
Switch to using CC-BY-4.0 to explicitly permit re-purposing and remixing of
the content.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Lars Kurth <lars.kurth@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Fri, 6 Dec 2019 10:02:09 +0000 (11:02 +0100)]
x86: don't offer Hyper-V option when "PV Shim Exclusive"
This only added dead code. Use "if" instead of "depends on" to make
(halfway) clear that other guest options should also go in the same
block. Move the option down such that the shim related options get
presented first, avoiding to ask the question when the answer may end
up being discarded.
While in the neighborhood also bring PV_SHIM_EXCLUSIVE into more
"canonical" shape.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 6 Dec 2019 10:01:18 +0000 (11:01 +0100)]
x86/nEPT: ditch nept_sp_entry()
It's bogusly non-static. It making the call sites actually less easy to
read, and there being another open-coded use in the file - let's just
get rid of it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Clean up intinfo_t variables
The type name is poor because the type is also used for the IDT vectoring
field, not just for the event injection field. Rename it to intinfo_t which
is how the APM refers to the data.
Rearrange the union to drop the .fields infix, and rename bytes to the more
common raw. Also take the opportunity to rename the fields in the VMCB to
increase legibility.
While adjusting all call sites, fix up style issues and make use of structure
assignments where applicable.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Don't shadow variables in svm_vmexit_handler()
The local variable eventinj is set to the value of vmcb->exitintinfo which is
confusing considering that it isn't vmcb->eventinj. The variable isn't
necessary to begin with, so drop it to avoid confusion.
A local rc variable is shadowed in the CPUID, #DB and #BP handlers.
There is a mix of spelling of inst_len and insn_len, all of which are
logically the same value. Consolidate on insn_len which also matches the name
of the emulation functions for obtaining instruction lengths, and avoid
shadowing it in the CPUID and TASK_SWITCH handlers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Clean up construct_vmcb()
The vmcb is zeroed on allocate - drop all explicit writes of 0. Move
hvm_update_guest_efer() to co-locate it with the other control register
updates.
Move the BUILD_BUG_ON() into build_assertions(), and add some offset checks
for fields after the large blocks of reserved fields (as these are the most
likely to trigger from a mis-edit). Take the opportunity to fold 6 adjacent
res* fields into one.
Finally, drop all trailing whitespace in the file.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 3 Dec 2019 16:59:09 +0000 (16:59 +0000)]
x86/svm: Fix handling of EFLAGS.RF on task switch
VT-x updates RF before vmexit, so eflags written into the outgoing TSS happens
to be correct. SVM does not update RF before vmexit, and instead provides it
via a bit in exitinfo2.
In practice, needing RF set in the outgoing state occurs when a task gate is
used to handle faults.
Extend hvm_task_switch() with an extra_eflags parameter which gets fed into
the outgoing TSS, and fill it in suitably from the SVM vmexit information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 13:29:20 +0000 (13:29 +0000)]
x86/svm: Minor cleanup to start_svm()
The function is init, so can use boot_cpu_data directly.
There is no need to write 0 to svm_feature_flags in the case of a CPUID
mismatch (not least because this is dead code on real hardware), and no need
to use locked bit operations.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Igor Druzhinin [Thu, 5 Dec 2019 12:31:03 +0000 (13:31 +0100)]
passthrough: drop break statement following c/s cd7dedad820
The locking responsibilities have changed and a premature break in
this section now causes the following assertion:
Assertion '!preempt_count()' failed at preempt.c:36
Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 28 Nov 2019 11:28:51 +0000 (11:28 +0000)]
x86/svm: Correct vm_event API for descriptor accesses
c/s d0a699a389f1 "x86/monitor: add support for descriptor access events"
introduced logic looking for what appeared to be exitinfo (not that this
exists in SVM - exitinfo1 or 2 do), but actually passed the exit IDT vectoring
information. There is never any IDT vectoring involved in these intercepts so
the value passed is always zero.
In fact, SVM doesn't provide any information, even in exitinfo1 and 2. Drop
the svm struct entirely, and bump the interface version.
In the SVM vmexit handler itself, optimise the switch statement by observing
that there is a linear transformation between the SVM exit_reason and
VM_EVENT_DESC_* values. (Bloat-o-meter reports 6028 => 5877 for a saving of
151 bytes).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Adrian Pop <apop@bitdefender.com> Acked-by: Jan Beulich <jbeulich@suse.com>