Roger Pau Monne [Fri, 9 Nov 2018 17:09:20 +0000 (18:09 +0100)]
guest/pvh: special case the low 1MB
When running as a PVH guest Xen only special cases the trampoline
code in the low 1MB, without also reserving the space used by the
relocated metadata or the trampoline stack.
Fix this by always reserving the low 1MB regardless of whether Xen is
running as a guest or natively.
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monne [Fri, 9 Nov 2018 17:09:19 +0000 (18:09 +0100)]
guest/pvh: fix handling of multiboot info and module list
When booting Xen as a PVH guest the data in the PVH start info
structure is copied over to a multiboot structure and a module list
array that resides in the .init section of the Xen image. The
resulting multiboot structures are then handled to the generic boot
process using their physical address.
This works fine as long as the Xen image doesn't relocate itself, if
there's such a relocation the physical addresses of the multiboot
structure and the module array are no longer valid.
Fix this by handling the virtual address of the multiboot structure
and module array to the generic boot process instead of it's physical
address.
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com>
x86/hvm: clean up the rest of bool_t from vm_event
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 9 Nov 2018 12:05:28 +0000 (13:05 +0100)]
pass-through: adjust pIRQ migration
For one it is quite pointless to iterate over all pIRQ-s the domain has
when just one is being adjusted. Introduce hvm_migrate_pirq() as an
externally accessible function.
Additionally it is bogus to migrate the pIRQ to a vCPU different from
the one the event is supposed to be posted to - if anything, it might be
worth considering not to migrate the pIRQ at all in the posting case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Thu, 1 Nov 2018 10:16:58 +0000 (10:16 +0000)]
xen/grant_table: Remove stale comment on top of map_grant_ref
Remove the 2 part comment on top of map_grant_ref:
- The first part mention the return value which has been void since
2006!
- The second part mention a local variable 'addr' which does not
exist anymore.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 9 Nov 2018 10:42:10 +0000 (11:42 +0100)]
cpufreq: convert to a single post-init driver (hooks) instance
This reduces the post-init memory footprint, eliminates a pointless
level of indirection at the use sites, and allows for subsequent
alternatives call patching.
Take the opportunity and also add a name to the PowerNow! instance.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Xin Li [Fri, 9 Nov 2018 10:41:30 +0000 (11:41 +0100)]
xsm: remove printing from set_to_dummy_if_null()
Filling dummy module's hook to null value of xsm_operations structure
will generate debug message. This becomes boot time spew for module
like silo, which only sets a few hooks of itself. So remove the printing
to avoid boot time spew.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Xin Li <xin.li@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Paul Durrant [Fri, 9 Nov 2018 10:40:12 +0000 (11:40 +0100)]
viridian: introduce struct viridian_page
The 'vp_assist' page is currently an example of a guest page which needs to
be kept mapped throughout the life-time of a guest, but there are other
such examples in the specifiction [1]. This patch therefore introduces a
generic 'viridian_page' type and converts the current vp_assist/apic_assist
related code to use it. Subsequent patches implementing other enlightments
can then also make use of it.
This patch also renames the 'vp_assist_pending' field in struct
hvm_viridian_vcpu_context to 'apic_assist_pending' to more accurately
reflect its meaning. The term 'vp_assist' applies to the whole page rather
than just the EOI-avoidance enlightenment. New versons of the specification
have defined data structures for other enlightenments within the same page.
Paul Durrant [Fri, 9 Nov 2018 10:39:27 +0000 (11:39 +0100)]
viridian: define type for the 'virtual VP assist page'
The specification [1] defines a type so we should use it, rather than just
OR-ing and AND-ing magic bits.
No functional change.
NOTE: The type defined in the specification does include an anonymous
sub-struct in the page type but, as we currently use only the first
element, the struct declaration has been omitted.
Paul Durrant [Fri, 9 Nov 2018 10:38:03 +0000 (11:38 +0100)]
viridian: separate time related enlightenment implementations...
...into new 'time' module.
This patch reduces the size of the main viridian source module by
moving time related enlightenments into their own source module. This is
done in anticipation of implementation of more such enightenments and
a desire to not further lengthen the main source module when this work
is done.
While moving the code:
- Move the declaration of HV_REFERENCE_TSC_PAGE from the header file into
the new source module, since it is only used there.
- Clean up a bool_t.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Paul Durrant [Fri, 9 Nov 2018 10:36:52 +0000 (11:36 +0100)]
viridian: separate interrupt related enlightenment implementations...
...into new 'synic' module.
The SynIC (synthetic interrupt controller) is specified [1] to be a super-
set of a virtualized LAPIC, and its definition encompasses all
enlightenments related to virtual interrupt control.
This patch reduces the size of the main viridian source module by giving
these enlightenments their own module. This is done in anticipation of
implementation of more such enlightenments and a desire not to further
lengthen then main source module when this work is done.
Whilst moving the code:
- Fix various style issues.
- Move the MSR definitions into the header (since they are now needed in
more than one source module).
Roger Pau Monne [Thu, 8 Nov 2018 14:23:58 +0000 (15:23 +0100)]
amd/pvh: enable ACPI C1E disable quirk on PVH Dom0
PV Dom0 has a quirk for some AMD processors, where enabling ACPI can
also enable C1E mode. Apply the same workaround as done on PV for a
PVH Dom0, which consist on trapping accesses to the SMI command IO
port and disabling C1E if ACPI is enabled.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This patch adds a couple of regs to the vm_event that are used by
the introspection. The base, limit and ar
bits are compressed into a uint64_t union so as not to enlarge the
vm_event.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Jan Beulich [Thu, 8 Nov 2018 14:59:14 +0000 (15:59 +0100)]
x86/genapic: remove indirection from genapic hook accesses
Instead of loading a pointer at each use site, have a single runtime
instance of struct genapic, copying into it from the individual
instances. The individual instances can this way also be moved to .init
(also adjust apic_probe[] at this occasion).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 7 Nov 2018 08:35:14 +0000 (09:35 +0100)]
p2m: move p2m-common.h inclusion point
The header is (hence its name) supposed to be a helper for the per-arch
p2m.h files. It was never supposed to be included directly, and for the
purpose of putting common function declarations into the common header
it is more helpful if things like p2m_t are already available at the
inclusion point.
This also undoes parts of 02ede7dc03 ("memory: add
check_get_page_from_gfn() as a wrapper..."), which had been there just
because of the unhelpful original way of including p2m-common.h.
Take the opportunity and also ditch a duplicate public/memory.h from the
ARM header.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Sergey Dyasli [Wed, 7 Nov 2018 08:34:17 +0000 (09:34 +0100)]
mm/page_alloc: make bootscrub happen in idle-loop
Scrubbing RAM during boot may take a long time on machines with lots
of RAM. Add 'idle' option to bootscrub which marks all pages dirty
initially so they will eventually be scrubbed in idle-loop on every
online CPU.
It's guaranteed that the allocator will return scrubbed pages by doing
eager scrubbing during allocation (unless MEMF_no_scrub was provided).
Use the new 'idle' option as the default one.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 7 Nov 2018 08:33:24 +0000 (09:33 +0100)]
x86: work around HLE host lockup erratum
XACQUIRE prefixed accesses to the 4Mb range of memory starting at 1Gb
are liable to lock up the processor. Disallow use of this memory range.
Unfortunately the available Core Gen7 and Gen8 spec updates are pretty
old, so I can only guess that they're similarly affected when Core Gen6
is and the Xeon counterparts are, too.
This is part of XSA-282.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Sergey Dyasli [Thu, 21 Jun 2018 14:35:50 +0000 (16:35 +0200)]
x86/domctl: Implement XEN_DOMCTL_get_cpu_policy
This finally (after literally years of work!) marks the point where the
toolstack can ask the hypervisor for the current CPUID configuration of a
specific domain.
Introduce a new flask access vector and update the default policies.
Also extend xen-cpuid's --policy mode to be able to take a domid and dump a
specific domains CPUID and MSR policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 2 Jul 2018 16:05:33 +0000 (16:05 +0000)]
x86: Introduce struct cpu_policy to refer to a group of individual policies
This is prep work for the following patch - please refer to it as well.
When auditing and manipulating policies, it is necessary to do so with a
complete set of policies, due to the interdependences of the contents. A
containing structure like this will allow for clearer APIs and code.
As a first user, this structure is convenient for the mapping used by
XEN_SYSCTL_get_cpu_policy (implemented in the next patch), and for auditing
(later when XEN_DOMCTL_set_cpu_policy is implemented).
At this point, the distinction between *_max and *_default is introduced into
the ABI. For now, *_default is mapped to *_max, but future development work
will result in *_default being a logical subset of *_max.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 21 Jun 2018 14:35:50 +0000 (16:35 +0200)]
libx86: Introduce a helper to serialise msr_policy objects
As with CPUID, an architectural form is used for representing the MSR data.
It is expected not to change moving forwards, but does have a 32 bit field
(currently reserved) which can be used compatibly if needs be.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 21 Jun 2018 14:35:49 +0000 (16:35 +0200)]
libx86: Introduce a helper to serialise cpuid_policy objects
The serialised form is made up of the leaf, subleaf and data tuple. As this
is the architectural form, it is expected not to change going forwards.
The serialisation of the Xen/Viridian leaves isn't fully implemented yet. It
is just enough to be bug-compatible with the current DOMCTL_set_cpuid
behaviour, but needs further hypervisor work before the toolstack can sensibly
control these values.
x86_cpuid_copy_to_buffer() is implemented using Xen's regular copy_to_guest
primitives, with an API-compatible memcpy() is used for the libxc half of the
build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 6 Nov 2018 15:41:25 +0000 (15:41 +0000)]
tools/dm_depriv: Add first cut RLIMITs
Limit the ability of a potentially compromised QEMU to consume system
resources. Key limits:
- RLIMIT_FSIZE (file size): 256KiB
- RLIMIT_NPROC (after uid changes to a unique uid)
NB that we do not yet set RLIMIT_AS (total virtual memory) or
RLIMIT_NOFILES (number of open files), since these require more care
and/or more coordination with QEMU to implement.
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Put global headers before local headers (sugg by Paul)
- Move #undif inside the braces (sugg by Paul)
Changes since v3:
- Align RLIMIT_ENTRY list for easier reading
- Fix wrong format string specifier
- Get rid of some trailing whitespace
Changes since v2:
- Use a macro to define rlimit entries
- Use RLIMIT_NLIMITS as an end-of-list marker, rather than -1
- Various style clean-ups
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:24 +0000 (15:41 +0000)]
tools/dm_restrict: Unshare mount and IPC namespaces on Linux
QEMU running under Xen doesn't need mount or IPC functionality.
Create and enter separate namespaces for each of these before
executing QEMU, so that in the event that other restrictions fail, the
process won't be able to even name system mount points or exsting
non-file-based IPC descriptors to attempt to attack them.
Unsharing is something a process can only do to itself (it would
seem); so add an os-specific "dm_preexec_restrict()" hook just before
we exec() the device model.
Also add checks to depriv-process-checker.sh to verify that dm is
running in a new namespace (or at least, a different one than the
caller).
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Fix function prototype for netbsd code
Changes since v3:
- Fix some more style issues
Changes since v2:
- Return an error rather than calling exit()
- Use LOGE() and print to the current stderr fd, rather than
printing to the new stderr fd via write()
- Use r for external return values rather than rc.
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:23 +0000 (15:41 +0000)]
tools/dm_restrict: Ask QEMU to chroot
When dm_restrict is enabled, ask QEMU to chroot into an empty directory.
* Create $XEN_RUN_DIR/qemu-root-<domid> (deleting the old one if it's there)
* Pass the -chroot option to QEMU
Rather than running `rm -rf` on the directory before creating it
(since there is no library function to do this), simply rmdir the
directory, relying on the fact that the previous QEMU instance, if
properly restricted, shouldn't have been able to write anything
anyway.
Suggested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Minor change to comment
- Update stale directory name in commit message
Changes since v2:
- Style fixes
- Testing moved to a different patch
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Anthony Perard <anthony.perard@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:22 +0000 (15:41 +0000)]
SUPPORT.md: Add qemu-depriv section
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Fix some grammar (s/attack/attacking/;)
Changes since v3:
- Moved from the qemu-depriv doc patches.
- Reword to include the possibility of having a non-dom0 "devicemodel"
domain which may want to be protected
- Specify `Linux dom0` as the currently-tech-supported window
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Konrad Wilk <konrad.wilk@oracle.com> CC: Stefano Stabellini <sstabellini@kernel.org> CC: Julien Grall <julien.grall@arm.com> CC: Anthony Perard <anthony.perard@citrix.com> CC: Ross Lagerwall <ross.lagerwall@citrix.com>
George Dunlap [Tue, 6 Nov 2018 15:41:22 +0000 (15:41 +0000)]
docs/qemu-deprivilege: Revise and update with status and future plans
docs/qemu-deprivilege.txt had some basic instructions for using
dm_restrict, but it was incomplete, misleading, and stale.
Update the docs in a number of ways.
First, separate user-facing documentation and technical description
into docs/features and docs/design, respectively.
In the feature doc:
* Introduce a section mentioning minimim versions of Linux, Xen, and
qemu required (TBD)
* Fix the discussion of qemu userid. Mention xen-qemuuser-range-base,
and provide example shell code that actually has some hope of working
(instead of failing out after creating 900 userids).
* Describe how to enable restrictions, as well as features which
probably don't or definitely don't work.
In the design doc, introduce a "Technical Details" section which
describes specifically what restrictions are currently done, and also
what restrictions we are looking at doing in the future.
The idea here is that as we implement the various items for the
future, we move them from "Restrictions still to do" to "Restrictions
done". This can also act as a design document -- a place for public
discussion of what can or should be done and how.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Changes since v4:
- Remove unnecessary FIXME
- Remove stale "Add SUPPORT.md"
Changes since v3:
- Fix typo (32->16)
- Use an example value not close to the `nobody` uids, but still a
multiple of 2^16.
- Mention that using a multiple of 2^16 may have advantages.
- Have the example create a group as well
- Reorganize two comments on the "range-base" method for clarity
Changes since v2:
- Extraneous privcmd / evtchn instances aren't closed
- Expand description of how to test fd deprivileging
- Rework and clarify two namespace sections, give reference for QEMU NAK
- Add more information about migration technical challenges
- In UID section, mention possibility of container ID collisions.
- Fix name of design document.
- Add SUPPORT.md statement. Specify Linux, to make sure that FreeBSD is
evaluated separately.
- Mention that `-sandbox` is a blacklist and why
Changes since v1:
- Break into two, and move into appropriate directories (rather than 'misc')
- Updated version requirements
- Distinguish between features which "don't yet work" and features which we never expect to work
- Update description of xen-restrict functionality
- Reorder and expand further restrictions
- Make it more clear which restrictions are available on Linux only
- Include detailed description of how to kill a process
- Add RLIMIT_NPROC as something we can do without further changes to qemu
- Document the need to check for the sandbox feature before using it
Thank you to Ross Lagerwall, whose description of what XenServer is
doing formed much of the basis for the text here.
CC: Ian Jackson <ian.jackson@citrix.com> CC: Wei Liu <wei.liu2@citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <jbeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Konrad Wilk <konrad.wilk@oracle.com> CC: Stefano Stabellini <sstabellini@kernel.org> CC: Julien Grall <julien.grall@arm.com> CC: Anthony Perard <anthony.perard@citrix.com> CC: Ross Lagerwall <ross.lagerwall@citrix.com>
Ian Jackson [Mon, 5 Nov 2018 18:40:49 +0000 (18:40 +0000)]
tools: ipxe: Correct download error handling
This shell fragment lacked set -e. So, eg if the download failed a
broken ipxe.tar.gz would be left behind.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Tested-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 5 Nov 2018 18:37:05 +0000 (18:37 +0000)]
tools: Once again honour, but no longer advertise GIT_HTTP env var
In "build: add autoconf to replace custom checks in tools/check"
--enable-githttp was introduced. But we missed this comment where it
was advertised.
Also, that commit had the effect of uncondtionally setting GIT_HTTP
from the configure variable. But the env var has been advertised in
some places as the way to specify this behaviour, and overriding it is
just unfriendly.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> CC: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Dario Faggioli [Fri, 19 Oct 2018 15:54:41 +0000 (17:54 +0200)]
tools: libxl/xl: run NUMA placement even when an hard-affinity is set
Right now, if either an hard or soft-affinity are explicitly specified
in a domain's config file, automatic NUMA placement is skipped. However,
automatic NUMA placement affects only the soft-affinity of the domain
which is being created.
Therefore, it is ok to let it run if an hard-affinity is specified. The
semantics will be that the best placement candidate would be found,
respecting the specified hard-affinity, i.e., using only the nodes that
contain the pcpus in the hard-affinity mask.
This is particularly helpful if global xl pinning masks are defined, as
made possible by commit aa67b97ed34279c43 ("xl.conf: Add global affinity
masks"). In fact, without this commit, defining a global affinity mask
would also mean disabling automatic placement, but that does not
necessarily have to be the case (especially in large systems).
Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Fri, 26 Oct 2018 13:13:44 +0000 (15:13 +0200)]
Release: add release note link to SUPPORT.md
In order to have a link to the release notes in the feature list
generated from SUPPORT.md add that link in the "Release Support"
section of that file.
The real link needs to be adapted when the version is being released.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 2 Nov 2018 17:46:38 +0000 (17:46 +0000)]
x86/vcpu: Remove struct vcpu allocation restriction when possible
There is no need for struct vcpu to live below the 4G boundary for PV guests,
or for HVM vcpus using HAP.
Plumb struct domain into alloc_vcpu_struct() so the x86 version can query the
domain's type and paging settings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Wei Liu [Fri, 2 Nov 2018 15:55:42 +0000 (15:55 +0000)]
x86: rearrange x86_64/entry.S
Split the file into two halves. The first half pertains to PV guest
code while the second half is mostly used by the hypervisor itself to
handle interrupts and exceptions.
No functional change intended.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 2 Nov 2018 15:55:39 +0000 (15:55 +0000)]
x86: make traps.c build with !CONFIG_PV
Provide a stub for pv_inject_event. Put code that accesses PV fields
and GDT / LDT fault handling code under CONFIG_PV. Move set_debugreg
to pv/misc-hypercalls.c.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:13:59 +0000 (11:13 +0100)]
x86emul: VME and PVI modes require a #GP(0) check first thing
As explicitly spelled out by the SDM, EFLAGS.VIF and EFLAGS.VIP both set
at the start of an instruction trigger #GP(0) independent of actual
instruction.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:13:09 +0000 (11:13 +0100)]
x86: deal with firmware setting bogus TSC_ADJUST values
The system Intel have handed me for AVX512 emulator work ("Gigabyte
Technology Co., Ltd. X299 AORUS Gaming 3 Pro/X299 AORUS Gaming 3
Pro-CF, BIOS F3 12/28/2017") would not come up under Xen - it hung in
the middle of Dom0 PCI initialization. As it turned out, Xen's time
management did not work because of the firmware setting (only) the boot
CPU's TSC_ADJUST MSR to a large negative value (on the order of -2^50).
Follow Linux (also shamelessly stealing their comments) in
- clearing the register for the boot CPU (we don't have a need for
exceptions here yet, as the only exception in Linux is a class of
systems Xen doesn't work on anyway as far as I'm aware),
- forcing non-negative values uniformly (commit 855615eee9 ["x86/tsc:
Remove the TSC_ADJUST clamp"] dropped this, but without this my
Haswell box won't boot anymore),
- syncing the registers within sockets.
Linux, prior to aforementioned commit, capped at 0x7fffffff as well, but as the
description there says this issue has been addressed with a microcode
update. Hence until someone runs into such a system without being able
to update its microcode, I think we should leave out that specific part.
In order to avoid making init_percpu_time() depend on running _before_
set_cpu_sibling_map() (and hence the booting CPU _not_ being accounted
in socket_cpumask[] yet), move that call slightly earlier in
start_secondary().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:12:39 +0000 (11:12 +0100)]
x86/TSC: don't allow deadline timer to be used with unfixed errata
In preparation of writes to the TSC_ADJUST MSR, avoid the bad
interaction of writes to it and the TSC_DEADLINE one. Presumably the
original Linux commit bd9240a18e ("x86/apic: Add TSC_DEADLINE quirk due
to errata") refers to e.g. KBW092. (Of course this is an issue also
without us writing the TSC_ADJUST MSR, if instead firmware did already.
The errata checking can't be put in init_apic_mappings() as Linux does,
as that runs before we update microcode on the boot CPU. It needs to
happen before consumers of tdt_enabled, i.e.
- __setup_APIC_LVTT() <- setup_APIC_timer() <- setup_boot_APIC_clock()
- <- calibrate_APIC_clock() <- setup_boot_APIC_clock()
- setup_boot_APIC_clock()
setup_boot_APIC_clock() gets called from smp_prepare_cpus(), which sits
after microcode loading (note that calibrate_APIC_clock() gets called
before setting tdt_enabled).
Also add an MFENCE as per Linux commit 5d7c631d92 ("x86/apic: Serialize
LVTT and TSC_DEADLINE writes"), but I see no reason to put a conditional
around it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 5 Nov 2018 10:11:39 +0000 (11:11 +0100)]
viridian: remove duplicate union types
The 'viridian_vp_assist', 'viridian_hypercall_gpa' and
'viridian_reference_tsc' union types are identical in layout. The layout
is also common throughout the specification [1].
This patch declares a common 'viridian_page_msr' type and converts the rest
of the code to use that type for both the hypercall and VP assist pages.
Also, rename 'viridian_guest_os_id' to 'viridian_guest_os_id_msr' since it
also is a union representing an MSR value.
Paul Durrant [Mon, 5 Nov 2018 10:10:55 +0000 (11:10 +0100)]
viridian: remove comments referencing section number in the spec
Microsoft has a habit of re-numbering sections in the spec. so avoid
referring to section numbers in comments. Also remove the URL for the
spec. from the boilerplate... Again, Microsoft has a habit of changing
these too.
This patch also cleans up some > 80 character lines.
Purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Wei Liu [Fri, 2 Nov 2018 12:34:12 +0000 (12:34 +0000)]
libxl/arm: fix guest type conversion
Commit 359970fd8b ("tools/libxl: Switch Arm guest type to PVH") missed
changing the type field in c_info. This issue didn't surface until ef72c93df9 which made creating PV guest on Arm unusable.
Create libxl__arch_domain_create_info_setdefault and switch the type
there.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
The may_defer var was left with the older bool_t type. This patch
changes the type to bool.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Fri, 2 Nov 2018 11:15:33 +0000 (12:15 +0100)]
VMX: fix vmx_handle_eoi()
In commit 303066fdb1e ("VMX: fix interaction of APIC-V and Viridian
emulation") I screwed up: Instead of clearing SVI, other ISR bits
should be taken into account.
Introduce a new helper set_svi(), split out of vmx_process_isr(), and
use it also from vmx_handle_eoi().
Following the problems in vmx_intr_assist() (see the still present big
block of debugging code there) also warn (once) if EOI'd vector and
original SVI don't match.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Commit 81946a73dc975a7dafe9017a8e61d1e64fdbedbf removed
Xenctrl.with_intf based on its undesirable behaviour of opening and
closing a Xenctrl connection with every invocation. This commit
re-introduces with_intf but with an updated behaviour: it maintains a
global Xenctrl connection which is opened upon first usage and kept
open. This handle can be obtained by clients using new functions
get_handle() and close_handle().
The main motivation of re-introducing with_intf is that otherwise
clients will have to implement this functionality individually.
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/misc/xenpm: fix getting info when some CPUs are offline
Use physinfo.max_cpu_id instead of physinfo.nr_cpus to get max CPU id.
This fixes for example 'xenpm get-cpufreq-para' with smt=off, which
otherwise would miss half of the cores.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Since udev is no longer used to call hotplug scripts (neither in dom0
nor driver domain), this scripts is no longer referenced anywhere. libxl
(xl devd or else) has own cleanup code.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Wed, 31 Oct 2018 16:57:19 +0000 (17:57 +0100)]
use consistent values when consuming runtime-changeable parameters
There's no guarantee that e.g. a switch() control expression's memory
operand(s) get(s) read just once. Guard against the compiler producing
"unexpected" code by sprinkling around some ACCESS_ONCE().
I'm leaving alone opt_conswitch[]: It gets accessed in quite a few
places anyway, and an intermediate change won't have any severe effect
afaict.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen/arm: make platform specific code dependent on ALL32_PLAT or ALL64_PLAT
Compile platform code that doesn't have its own specific kconfig option
only if ALL32_PLAT or ALL64_PLAT depending on the architecture. The
benefit is that choosing one of the platforms available as a menu
option allows the user not to build other unnecessary platform code.
Andrew Cooper [Tue, 30 Oct 2018 11:17:00 +0000 (11:17 +0000)]
x86/pv: Fix crash when using `xl set-parameter pcid=...`
"pcid=" is registered as a runtime parameter, which means that parse_pcid()
must not reside in .init, or the following happens when parse_params() tries
to call an unmapped function pointer.
Andrew Cooper [Thu, 25 Oct 2018 13:11:58 +0000 (14:11 +0100)]
x86/vvmx: Don't handle unknown nested vmexit reasons at L0
This is very dangerous from a security point of view, because a missing entry
will cause L2's action to be interpreted as L1's action.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 14:17:50 +0000 (15:17 +0100)]
x86/vvmx: Drop the now-obsolete vmx_inst_check_privilege()
Now that nvmx_handle_vmx_insn() performs all VT-x instruction checks, there is
no need for redundant checking in vmx_inst_check_privilege(). Remove it, and
take out the vmxon_check boolean which was plumbed through decode_vmx_inst().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 13:40:11 +0000 (14:40 +0100)]
x86/vvmx: Unconditionally initialise vmxon_region_pa during vcpu construction
This is a stopgap solution until the toolstack side of initialisation can be
sorted out, but it does result in the nvmx_vcpu_in_vmx() predicate working
correctly even when nested virt hasn't been enabled for the domain.
Update nvmx_handle_vmx_insn() to include the in-vmx mode check (for all
instructions other than VMXON) to complete the set of #UD checks.
In addition, sanity check that the nested vmexit handler has worked correctly,
and that we are only providing emulation of the VT-x instructions to L1
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 13:08:33 +0000 (14:08 +0100)]
x86/vvmx: Let L1 handle all the unconditional vmexit instructions
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 28 May 2018 14:22:49 +0000 (15:22 +0100)]
x86: Reorganise and rename debug register fields in struct vcpu
Reusing debugreg[5] for the PV emulated IO breakpoint information is confusing
to read. Instead, introduce a dr7_emul field in pv_vcpu for the purpose.
With the PV emulation out of the way, debugreg[4,5] are entirely unused and
don't need to be stored.
Rename debugreg[0..3] to dr[0..3] to reduce code volume, but keep them as an
array because their behaviour is identical and this helps simplfy some of the
PV handling. Introduce dr6 and dr7 fields to replace debugreg[6,7] which
removes the storage for debugreg[4,5].
In arch_get_info_guest(), handle the merging of emulated dr7 state alongside
all other dr handling, rather than much later.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 29 Oct 2018 11:29:54 +0000 (11:29 +0000)]
x86/domain: Fix build with GCC 4.3.x
GCC 4.3.x can't initialise the user_regs structure like this.
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
arm,smmu: backport "Disable stalling faults for all endpoints"
Backport commit 3714ce1d6655098ee69ede632883e5874d67e4ab
"iommu/arm-smmu: Disable stalling faults for all endpoints" from the
Linux kernel. This works-around Erratum #842869.
Original commit message:
Enabling stalling faults can result in hardware deadlock on poorly
designed systems, particularly those with a PCI root complex upstream of
the SMMU.
Although it's not really Linux's job to save hardware integrators from
their own misfortune, it *is* our job to stop userspace (e.g. VFIO
clients) from hosing the system for everybody else, even if they might
already be required to have elevated privileges.
Given that the fault handling code currently executes entirely in IRQ
context, there is nothing that can sensibly be done to recover from
things like page faults anyway, so let's rip this code out for now and
avoid the potential for deadlock.
Cc: <stable@vger.kernel.org> Fixes: 48ec83bcbcf5 ("iommu/arm-smmu: Add initial driver support for ARM SMMUv3 devices") Reported-by: Matt Evans <matt.evans@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
George Dunlap [Mon, 29 Oct 2018 14:51:51 +0000 (14:51 +0000)]
Make credit2 the default scheduler
Credit2 was declared "supported" in 4.8, and as of 4.10 had two other
critical features implemented (soft affinity / NUMA and caps).
Why change the default?
The code is better: more predictable, less jitter, easier to determine
how modifications will affect overall behavior, easier in the future
to make load-balancing behavior more subtle (e.g., taking into account
the cost of powering up extra cores, &c).
Overall performance compared to Credit1 is somewhat of a mixed bag.
Unfortunately most of what I have are tests using XenServer's internal
perf testing system, so I can't share the raw data (via links anyway).
Here is a summary of data from an internal e-mail Dario sent in the
past:
* DVDbench: On underloaded systems, credit2 outperformed credit1 by
about 4%. On overloaded systems, credit2 underperformed by about 3%.
* On a range of tests (unixbench, lmbench, &c), credit and credit2
perform within 5% of each other (up and down).
* Credit2 fairly consistently beats credit for TCP-style workloads.
* Credit2 is sometimes equal to, sometimes 5-15% worse than, credit for
synthetic CPU workloads (e.g., Dhrystone).
* On LoginVSI, credit2 fairly consistently outperforms credit by about 10%.
Credit2, like credit, has a number of workloads / setups for which
performance could be improved. Personally I think networking and
partially-loaded systems is going to be more representative of what
Xen is actually used for; so I think credit2 is on the whole the
better scheduler to use by default. And in any case, making those
improvements on credit2 will be easier than on credit.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Mon, 29 Oct 2018 12:40:56 +0000 (13:40 +0100)]
x86emul: generalize vector length handling for AVX512/EVEX
To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).
Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 19 Oct 2018 14:28:28 +0000 (15:28 +0100)]
x86: put some code in arch_set_info_guest under CONFIG_PV
This function is called by both PV and HVM. Unfortunately the code is
very convoluted. We can reason that code between the call to
hvm_set_info_guest and out label is PV only. Put that portion under
CONFIG_PV.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Fri, 19 Oct 2018 14:28:27 +0000 (15:28 +0100)]
x86: make mm.c build with !CONFIG_PV
Start by putting hypercall handlers which are supposed to be PV only
under CONFIG_PV. Shuffle some code around to avoid introducing
excessive numbers of CONFIG_PV.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 26 Oct 2018 15:50:01 +0000 (17:50 +0200)]
x86emul: correct EVEX decoding
Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.
Note: Despite what SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
outside of 64-bit mode when the field does not encode a register.
Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
is required to be 1 there.
Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:21:20 +0000 (15:21 +0200)]
x86emul/test: introduce eq()
In preparation for sensible to-boolean conversion on AVX512, wrap
another abstraction function around the present to_bool(<x> == <y>), to
get rid of the open-coded == (which will get in the way of using
built-in functions instead). For the future AVX512 use scalar operands
can't be used then anymore: Use (vec_t){} when the operand is zero,
and broadcast (if available) otherwise (assume pre-AVX512 when broadcast
is not available, in which case a plain scalar is still fine).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:20:37 +0000 (15:20 +0200)]
x86emul: support AVX512 opmask insns
These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.
The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.
Note that outside of 64-bit mode and despite the SDM not currently
saying so, VEX.W is ignored for the KMOV{D,Q} encodings to/from GPRs,
just like e.g. for the similar VMOV{D,Q}.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:18:52 +0000 (15:18 +0200)]
x86: restrict HVMOP_pagetable_dying to current
This is not used (and probably was never meant to be) by the tool stack.
Limiting it to the current domain in particular allows to eliminate a
bogus use of vCPU 0 in pagetable_dying().
Remove the now unnecessary domain/vCPU parameters from the wrapper/hook
functions at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:16:23 +0000 (15:16 +0200)]
x86: don't build guest-walk code without HVM and SHADOW_PAGING
It's dead code in that case.
We could go further, as we don't really need the 2- and 3-level walk
code in PV mode, but to drop their compilation requires quite a bit of
disentangling of shadow mode code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>