Move p2m_{get/set}_suppress_ve() to p2m.c, replace incorrect
ASSERT() in p2m-pt.c (since a guest can run in shadow mode even on
a system with virt exceptions, which would trigger the ASSERT()),
move the VMX-isms (cpu_has_vmx_virt_exceptions checks) to
p2m_ept_{get/set}_entry(), and fix locking code in
p2m_get_suppress_ve().
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
fuzz, test x86_emulator: disable sse before including always_inline fns
Workaround for compiler rejection of SSE-using always_inlines defined before
SSE is disabled.
Compiling with _FORTIFY_SOURCE or higher levels of optimization enabled
will always_inline several library fns (memset, memcpy, ...)
(with gcc 8.2.0 and glibc 2.28).
In fuzz and x86_emulator test, the compiler is instructed not
to generate SSE instructions via: #pragma GCC target("no-sse")
because those registers are needed for use by the workload.
The combination above causes compilation failure as the inline functions
use those instructions. This is resolved by reordering the inclusion of
<stdio.h> and <string.h> to after the pragma disabling SSE generation.
It would be preferable to locate the no-sse pragma within x86-emulate.h at the
top of the file, prior to including any other headers; unfortunately doing so
before <stdlib.h> causes compilation failure due to declaration of 'atof' with:
"SSE register return with SSE disabled".
Fortunately there is no (known) current dependency on any always_inline
SSE-inclined function declared in <stdlib.h> or any of its dependencies, so the
pragma is therefore issued immediately after inclusion of <stdlib.h> with a
comment introduced to explain its location there.
Add compile-time checks for unwanted prior inclusion of <string.h> and
<stdio.h>, which are the two headers that provide the library functions that
are handled with wrappers and listed within "x86-emulate.h" as ones "we think
might access any of the FPU state".
* Use standard-defined "EOF" macro to detect prior <stdio.h> inclusion.
* Use "_STRING_H" (non-standardized guard macro) as best-effort
for detection of prior <string.h> inclusion. This is non-universally
viable but will provide error output on common GLIBC systems, so
provides some defensive coverage.
Adds conditional #include <stdio.h> to x86-emulate.h because fwrite, printf,
etc. are referenced when WRAP has been defined.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 17 Sep 2018 14:49:14 +0000 (15:49 +0100)]
xen: Disallow variable length arrays
Variable length arrays result in excess stack utilisation, with a risk
of stack overflow if the length is too large. It also results in fairly
poor asm generation, because of requiring a divide as part of the space
calcuation.
Xen no longer has any variable length arrays, so take the opportunity to
formally disallow them.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 17 Sep 2018 15:32:32 +0000 (16:32 +0100)]
x86/hvm: Adjust hvmemul_rep_stos() to compile with -Wvla
When using -Wvla, the typecast of buf triggers a Variable Length Array
warning. This is less than ideal, as this typecast doesn't occupy any stack
space, but we don't have a finer grain option to use.
Alter the asm expression to avoid the typecast, which necessitates the
introduction of a memory clobber as the compiler can no longer identify
the total quantity of written memory.
Despite the memory clobber, there is no change to the generated asm.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 17 Sep 2018 15:30:53 +0000 (16:30 +0100)]
x86/PoD: Avoid using variable length arrays in p2m_pod_zero_check()
Callers of p2m_pod_zero_check() pass a count of up to POD_SWEEP_STRIDE.
Move the definition of POD_SWEEP_STRIDE and give the arrays a fixed
bound.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Mon, 17 Sep 2018 15:21:53 +0000 (16:21 +0100)]
x86/PoD: Simplify handling of the quick check
There is no need to duplicate the contents of the skip block.
While cleaning up this function, change 4 ints to be unsigned.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Tue, 25 Sep 2018 09:47:10 +0000 (10:47 +0100)]
Make credit2 the default scheduler
Credit2 was declared "supported" in 4.8, and as of 4.10 had two other
critical features implemented (soft affinity / NUMA and caps).
Why change the default?
The code is better: more predictable, less jitter, easier to determine
how modifications will affect overall behavior, easier in the future
to make load-balancing behavior more subtle (e.g., taking into account
the cost of powering up extra cores, &c).
Overall performance compared to Credit1 is somewhat of a mixed bag.
Unfortunately most of what I have are tests using XenServer's internal
perf testing system, so I can't share the raw data (via links anyway).
Here is a summary of data from an internal e-mail Dario sent in the
past:
* DVDbench: On underloaded systems, credit2 outperformed credit1 by
about 4%. On overloaded systems, credit2 underperformed by about 3%.
* On a range of tests (unixbench, lmbench, &c), credit and credit2
perform within 5% of each other (up and down).
* Credit2 fairly consistently beats credit for TCP-style workloads.
* Credit2 is sometimes equal to, sometimes 5-15% worse than, credit for
synthetic CPU workloads (e.g., Dhrystone).
* On LoginVSI, credit2 fairly consistently outperforms credit by about 10%.
Credit2, like credit, has a number of workloads / setups for which
performance could be improved. Personally I think networking and
partially-loaded systems is going to be more representative of what
Xen is actually used for; so I think credit2 is on the whole the
better scheduler to use by default. And in any case, making those
improvements on credit2 will be easier than on credit.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
Wei Liu [Fri, 21 Sep 2018 15:54:51 +0000 (16:54 +0100)]
x86/mm: put HVM only code under CONFIG_HVM
Going through the code, HAP, EPT, PoD and ALTP2M depend on HVM code.
Put these components under CONFIG_HVM. This further requires putting
one of the vm event under CONFIG_HVM.
Altp2m requires a bit more attention because its code is embedded in
generic x86 p2m code.
Also make hap_enabled evaluate to false when !CONFIG_HVM. Make sure it
evaluate its parameter to avoid unused variable warnings in its users.
Also sort items in Makefile while at it.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Wei Liu [Fri, 21 Sep 2018 15:54:49 +0000 (16:54 +0100)]
x86/p2m/pod: make it build with !CONFIG_HVM
Populate-on-demand is HVM only.
Provide a bunch of stubs for common p2m code and guard one invocation
of guest_physmap_mark_populate_on_demand with is_hvm_domain.
Put relevant fields in p2m_domain and code which touches those fields
under CONFIG_HVM.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Wed, 21 Feb 2018 17:54:13 +0000 (17:54 +0000)]
x86: Clean up the Xen MSR infrastructure
Rename them to guest_{rd,wr}msr_xen() for consistency, and because the _regs
suffix isn't very appropriate.
Update them to take a vcpu pointer rather than presuming that they act on
current, and switch to using X86EMUL_* return values.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 20 Sep 2017 17:33:59 +0000 (17:33 +0000)]
x86/viridan: Clean up Viridian MSR infrastructure
Rename the functions to guest_{rd,wr}msr_viridian() for consistency, and
because the _regs() suffix isn't very appropriate.
Update them to take a vcpu pointer rather than presuming that they act on
current, which is safe for all implemented operations, and switch their return
ABI to use X86EMUL_*.
The default cases no longer need to deal with MSRs out of the Viridian range,
but drop the printks to debug builds only and identify the value attempting to
be written.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 20 Sep 2017 17:33:59 +0000 (18:33 +0100)]
x86/msr: Dispatch Xen and Viridian MSRs from guest_{wr,rd}msr()
Despite the complicated diff in {svm,vmx}_msr_write_intercept(), it is just
the 0 case losing one level of indentation, as part of removing the call to
wrmsr_hypervisor_regs().
The case blocks in guest_{wr,rd}msr() use raw numbers, partly for consistency
with the CPUID side of things, but mainly because this is clearer code to
follow. In particular, the Xen block may overlap with the Viridian block if
Viridian is not enabled for the domain, and trying to express this with named
literals caused more confusion that it solved.
Future changes with clean up the individual APIs, including allowing these
MSRs to be usable for vcpus other than current (no callers exist with v !=
current).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This patch adds image size and flags to XEN image header. It uses
those fields according to the updated Linux kernel image definition.
With this patch bootloader can now place XEN image anywhere in system
RAM at 2MB aligned address without to worry about relocation.
For instance, it fixes the XEN boot on Amlogic SoC where bootloader(U-BOOT)
always relocates the XEN image to an address range reserved for firmware data.
Signed-off-by: Amit Singh Tomar <amittomer25@gmail.com> Reviewed-by: Andre Pryzwara <andre.przywara@arm.com> Acked-by: Julien Grall <julien.grall@arm.com>
libxl: keep assigned pci devices across domain reboots
Fill the from_xenstore libxl_device_type hook for PCI devices so that
libxl_retrieve_domain_configuration can properly retrieve PCI devices
from xenstore.
This fixes disappearing pci devices across domain reboots.
Reported-by: Andreas Kinzler <hfp@posteo.de> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
x86/pvh: copy data from low 1MB to Dom0 physmap instead of mapping it
Identity mapping RAM regions on the low 1MB for Dom0 is not ideal,
since there's data there that could be used by Xen during runtime
(like the AP trampoline), so instead of identity mapping the low 1MB
into the Dom0 physmap populate those RAM regions and copy the data.
Note that this allows to remove unshare_xen_page_with_guest since the
only caller was the PVH Dom0 builder.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 21 Sep 2018 10:21:32 +0000 (12:21 +0200)]
x86/mm: re-indent after "re-arrange get_page_from_l<N>e() vs pv_l1tf_check_l<N>e()"
That earlier change introduced two "else switch ()" constructs which now
get converted back to "normal" style (indentation). To limit indentation
depth, a conditional gets inverted in ptwr_emulated_update().
No functional change intended.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Adrian Pop [Tue, 4 Sep 2018 04:59:22 +0000 (07:59 +0300)]
x86/altp2m: Allow setting the #VE info page for an arbitrary VCPU
In a classic HVI + Xen setup, the introspection engine would monitor
legacy guest page-tables by marking them read-only inside the EPT; this
way any modification explicitly made by the guest or implicitly made by
the CPU page walker would trigger an EPT violation, which would be
forwarded by Xen to the SVA and thus the HVI agent. The HVI agent would
analyse the modification, and act upon it - for example, a virtual page
may be remapped (its guest physical address changed inside the
page-table), in which case the introspection logic would update the
protection accordingly (remove EPT hook on the old gpa, and place a new
EPT hook on the new gpa). In other cases, the modification may be of no
interest to the introspection engine - for example, the accessed/dirty
bits may be cleared by the operating system or the accessed/dirty bits
may be set by the CPU page walker.
In our tests we discovered that the vast majority of guest page-table
modifications fall in the second category (especially on Windows 10 RS4
x64 - more than 95% of ALL the page-table modifications are irrelevant to
us) - they are of no interest to the introspection logic, but they
trigger a very costly EPT violation nonetheless. Therefore, we decided
to make use of the new #VE & VMFUNC features in recent Intel CPUs to
accelerate the guest page-tables monitoring in the following way:
1. Each monitored page-table would be flagged as being convertible
inside the EPT, thus enabling the CPU to deliver a virtualization
exception to he guest instead of generating a traditional EPT
violation.
2. We inject a small filtering driver inside the protected guest VM,
which would intercept the virtualization exception in order to handle
guest page-table modifications.
3. We create a dedicated EPT view (altp2m) for the in-guest agent, which
would isolate the agent from the rest of the operating system; the
agent will switch in and out of the protected EPT view via the VMFUNC
instruction placed inside a trampoline page, thus making the agent
immune to malicious code inside the guest.
This way, all the page-table accesses would generate a
virtualization-exception inside the guest instead of a costly EPT
violation; the #VE agent would emulate and analyse the modification, and
decide whether it is relevant for the main introspection logic; if it is
relevant, it would do a VMCALL and notify the introspection engine
about the modification; otherwise, it would resume normal instruction
execution, thus avoiding a very costly VM exit.
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Paul Durrant [Tue, 11 Sep 2018 15:01:08 +0000 (16:01 +0100)]
tools: add option to explicitly enable VirtFS in QEMU build
9pfs support has been a documented feature since Xen 4.9, but QEMU will
not be built with backend support unless VirtFS is enabled, which is
predicated on the libcap and libattr dev packages being installed. This is
not obvious to anyone intending to use 9pfs.
This patch adds an 'enable-9pfs' option to configure which, if specified,
will cause '--enable-virtfs' to be passed to QEMU's configure. This will
cause the dependency on libcap and libattr to be called out if the packages
are not in installed.
For completeness, specifying 'disable-9pfs' will cause '--disable-virtfs' to
be passed to QEMU's confgure and not specifying an option will keep the
previous behaviour of predicating VirtFS on whether the libcap and libattr
packages are installed.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Wei Liu <wei.liu2@citrix.com>
xen: sched/Credit2: fix bug when moving CPUs between two Credit2 cpupools
Whether or not a CPU is assigned to a runqueue (and, if yes, to which
one) within a Credit2 scheduler instance must be both a per-cpu and
per-scheduler instance one.
In fact, when we move a CPU between cpupools, we first setup its per-cpu
data in the new pool, and then cleanup its per-cpu data from the old
pool. In Credit2, when there currently is no per-scheduler, per-cpu
data (as the cpu-to-runqueue map is stored on a per-cpu basis only),
this means that the cleanup of the old per-cpu data can mess with the
new per-cpu data, leading to crashes like this:
Basically, when csched2_deinit_pdata() is called for CPU 13, for fully
removing the CPU from Pool-0, per_cpu(13,runq_map) already contain the
id of the runqueue to which the CPU has been assigned in the scheduler
of Pool-1, which means wrong runqueue manipulations happen in Pool-0's
scheduler. Furthermore, at the end of such call, that same runq_map is
updated with -1, which is what causes the BUG_ON in csched2_schedule(),
on CPU 13, to trigger.
So, instead of reverting a2c4e5ab59d "xen: credit2: make the cpu to
runqueue map per-cpu" (as we don't want to go back to having the huge
array in struct csched2_private) add a per-cpu scheduler specific data
structure, like, for instance, Credit1 has already. That (for now) only
contains one field: the id of the runqueue the CPU is assigned to.
Andrew Cooper [Wed, 5 Sep 2018 17:32:52 +0000 (17:32 +0000)]
xen/vcpu: Introduce vcpu_destroy()
Like _domain_destroy(), this will eventually idempotently free all parts of a
struct vcpu.
While breaking apart the failure path of vcpu_create(), rework the codeflow to
be in a line at the end of the function for clarity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 5 Sep 2018 16:48:02 +0000 (16:48 +0000)]
xen/vcpu: Rename the common interfaces for consistency
The vcpu functions are far less consistent than the domain side of things, and
in particular, has vcpu_destroy() for architecture specific functionality.
which makes the vcpu hierarchy consistent with the domain hierarchy.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
tools/tests/depriv/Makefile directly builds the target program from
its C-source. This is problematic when an incremental build is needed
after a header the program is depending on has been modified: in this
case all headers are added into the gcc call and the build will fail.
Correct that by adding a rule for building the program from its .o
file.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 31 Aug 2018 07:02:42 +0000 (01:02 -0600)]
tools/tests: allow depriv-fd-checker to build with really old Linux headers
Assuming it was intentional for this test utility, other than most other
ones, to always be built, I think it would be nice if it didn't fail to
build on really old distros just because of the lack of a TUNGETIFF
definition.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 24 Aug 2018 20:01:40 +0000 (21:01 +0100)]
xen: decouple HVM and IOMMU capabilities
HVM and IOMMU are two distinct hardware features, yet they were
bundled together in sysctl and xl's output.
Decouple them on sysctl level. On toolstack level we still need to
maintain a sensible semantics for `xl info`. Massage the information
according to the following table:
Alexandru Isaila [Mon, 10 Sep 2018 14:27:00 +0000 (16:27 +0200)]
x86/domctl: don't pause the whole domain if only getting vcpu state
This patch is focused on moving changing hvm_save_one() to save one
typecode from one vcpu and now that the save functions get data from a
single vcpu we can pause the specific vcpu instead of the domain.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Alexandru Isaila [Mon, 10 Sep 2018 14:27:00 +0000 (16:27 +0200)]
x86/hvm: remove redundant save functions
This patch removes the redundant save functions and renames the
save_one* to save. It then changes the domain param to vcpu in the
save funcs and adapts print messages in order to match the format of the
other save related messages.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Alexandru Isaila [Mon, 10 Sep 2018 14:26:00 +0000 (16:26 +0200)]
x86/hvm: introduce hvm_save_cpu_msrs_one()
This is used to save data from a single instance.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/mm: change default value for suppress #VE in set_mem_access()
The default value for the "suppress #VE" bit set by set_mem_access()
currently depends on whether the call is made from the same domain (the
bit is set when called from another domain and cleared if called from
the same domain). This patch changes that behavior to inherit the old
suppress #VE bit value if it is already set and to set it to 1
otherwise, which is safer and more reliable.
Signed-off-by: Vlad Ioan Topan <itopan@bitdefender.com> Signed-off-by: Adrian Pop <apop@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
x86/iommu: add map-reserved dom0-iommu option to map reserved memory ranges
Several people have reported hardware issues (malfunctioning USB
controllers) due to iommu page faults on Intel hardware. Those faults
are caused by missing RMRR (VTd) entries in the ACPI tables. Those can
be worked around on VTd hardware by manually adding RMRR entries on
the command line, this is however limited to Intel hardware and quite
cumbersome to do.
In order to solve those issues add a new dom0-iommu=map-reserved
option that identity maps all regions marked as reserved in the memory
map. Note that regions used by devices emulated by Xen (LAPIC, IO-APIC
or PCIe MCFG regions) are specifically avoided. Note that this option
is available to all Dom0 modes (as opposed to the inclusive option
which only works for PV Dom0).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
iommu: make iommu_inclusive_mapping a suboption of dom0-iommu
Introduce a new dom0-iommu=map-inclusive generic option that
supersedes iommu_inclusive_mapping. The previous behavior is preserved
and the option should only be enabled by default on Intel hardware.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Andrew Cooper [Thu, 6 Sep 2018 13:40:56 +0000 (14:40 +0100)]
xen/sched: Re-position the domain_update_node_affinity() call during vcpu construction
alloc_vcpu()'s call to domain_update_node_affinity() has existed for a decade,
but its effort is mostly wasted.
alloc_vcpu() is called in a loop for each vcpu, bringing them into existence.
The values of the affinity masks are still default, which is allcpus in
general, or a processor singleton for pinned domains.
Furthermore, domain_update_node_affinity() itself loops over all vcpus
accumulating the masks, making it quadratic with the number of vcpus.
Move it to be called once after all vcpus are constructed, which has the same
net effect, but with fewer intermediate memory allocations and less cpumask
arithmetic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Tue, 11 Sep 2018 13:06:23 +0000 (15:06 +0200)]
x86/HVM: don't #GP/#SS on wrapping virt->linear translations
Real hardware wraps silently in most cases, so we should behave the
same. Also split real and VM86 mode handling, as the latter really
ought to have limit checks applied.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 11 Sep 2018 13:05:09 +0000 (15:05 +0200)]
x86/shadow: a little bit of style cleanup
Correct indentation of a piece of code, adjusting comment style at the
same time. Constify gl3e pointers and drop a bogus (and useless once
corrected) cast.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Wed, 29 Aug 2018 16:39:10 +0000 (16:39 +0000)]
xen: Fix inconsistent callers of panic()
Callers are inconsistent with whether they pass a newline to panic(),
including adjacent calls in the same function using different styles.
painc() not expecting a newline is inconsistent with most other printing
functions, which is most likely why we've gained so many inconsistencies.
Switch panic() to expect a newline, and update all callers which currently
lack a newline to include one.
This actually reduces the size of .rodata (0x07e3e8 down to 0x07e3a8) because
a number of strings are passed to both panic() and printk(). As they
previously differed by \n alone, they couldn't be merged.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Tue, 11 Sep 2018 09:06:41 +0000 (11:06 +0200)]
SVM: limit GIF=0 region
Use EFLAGS.IF for most ordinary purposes; there's in particular no need
to unduly defer NMI/#MC. Clear GIF only immediately before VMRUN itself.
This has the additional advantage that svm_stgi_label now indeed marks
the only place where GIF gets set.
Note regarding the main STI placement: Quite counterintuitively the
host's EFLAGS.IF continues to have a meaning while the guest runs; see
PM Vol 2 section "Physical (INTR) Interrupt Masking in EFLAGS". Hence we
need to set the flag for the duration of time being in guest context.
However, SPEC_CTRL_ENTRY_FROM_HVM wants to be carried out with EFLAGS.IF
clear.
Note regarding the main STGI placement: It could be moved further up,
but at present SPEC_CTRL_EXIT_TO_HVM is not NMI/#MC-safe.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Tue, 11 Sep 2018 09:03:46 +0000 (11:03 +0200)]
x86/HVM: split page straddling emulated accesses in more cases
Assuming consecutive linear addresses map to all RAM or all MMIO is not
correct. Nor is assuming that a page straddling MMIO access will access
the same emulating component for both parts of the access. If a guest
RAM read fails with HVMTRANS_bad_gfn_to_mfn and if the access straddles
a page boundary, issue accesses separately for both parts.
The extra call to known_gla() from hvmemul_write() is just to preserve
original behavior; for consistency the check also gets added to
hvmemul_rmw() (albeit I continue to be unsure whether we wouldn't better
drop both).
Note that the correctness of this depends on the MMIO caching used
elsewhere in the emulation code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Tue, 11 Sep 2018 09:02:37 +0000 (11:02 +0200)]
x86/HVM: drop hvm_fetch_from_guest_linear()
It can easily be expressed through hvm_copy_from_guest_linear(), and in
two cases this even simplifies callers.
Suggested-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
ebitmap.c:244:32: error: invalid conversion specifier 'Z' [-Werror,-Wformat-invalid-specifier]
"match my size %Zd (high bit was %d)\n", mapunit,
~^
ebitmap.c:245:16: error: format specifies type 'int' but the argument has type 'unsigned long'
[-Werror,-Wformat]
sizeof(u64) * 8, e->highbit);
^~~~~~~~~~~~~~~
ebitmap.c:245:33: error: data argument not used by format string [-Werror,-Wformat-extra-args]
sizeof(u64) * 8, e->highbit);
Use %zd instead of %Zd, which is compliant with C99.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Jan Beulich [Tue, 11 Sep 2018 09:00:01 +0000 (11:00 +0200)]
x86/HVM: meet xentrace's expectations on emulation event data
According to the logic in hvm_mmio_assist_process(), 64 bits of data are
expected with 64-bit addresses, and 32 bits of data with 32-bit ones. I
don't think this is very reasonable, but I'm also not going to touch the
consumer side, the more that it is anyway not very helpful for the code
here to only ever supply 32 bits of data (despite the field being 64
bits wide, and having been even in the 32-bit days of Xen).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Wei Liu [Fri, 7 Sep 2018 10:41:31 +0000 (11:41 +0100)]
mkdeb: use compression level 0
This requires calling dpkg-deb directly and pass it -z0.
It reduces the time to run the mkdeb script from 14 seconds to 3
seconds on my workstation with SSD, from 87s to 15s on a machine
with HDD. The deb file grows from 49M to 58M.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Olaf Hering [Thu, 30 Aug 2018 10:05:11 +0000 (12:05 +0200)]
tools/mkrpm: switch payload to gzip to reduce turnaround time
rpmbuild -bb spents alot of time in compressing the binaries. Reduce the
turnaround time of 'make rpmball' by using gzip as compression tool.
This reduces the buildtime from 'w9.xzdio'/138 seconds to 'w1.gzdio'/88
seconds in my environment.
The downside is an increased filesize of xen.rpm, 19MB vs. 37MB.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
In order to build a tailored pvshim-only binary from Xen. Switch the
PV shim build from the tools firmware into using the new defconfig.
A diff of the .config generated for the pvshim firmware build before
and after this change shows no differences.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/dmar: zap DMAR signature for dom0 once in TBOOT case
Commit 6c298ecc1f ("vtd: Reinstate ACPI DMAR on system shutdown or
S3/S4/S5") did everything for acpi_dmar_zap() call to be unnecessary,
except for invoking the function from acpi_parse_dmar(), which 123c779379 ("VTd/dmar: Tweak how the DMAR table is clobbered")
added several years later.
Some stale comments are also removed, No functional change.
Andrew Cooper [Wed, 29 Aug 2018 16:27:44 +0000 (16:27 +0000)]
xen/ARM+sched: Don't opencode %pv in printk()'s
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Tue, 27 Feb 2018 17:22:40 +0000 (17:22 +0000)]
xen/domctl: Drop vcpu_alloc_lock
Since its introduction in c/s 8cbb5278e "x86/AMD: Add support for AMD's OSVW
feature in guests", the OSVW data has been corrected to be per-domain rather
than per-vcpu, and is initialised during XEN_DOMCTL_createdomain.
Furthermore, because XENPF_microcode_update uses hypercall continuations to
move between CPUs, it drops the vcpu_alloc_lock mid update, meaning that it
didn't provided the interlock guarantee that the OSVW patch was looking for in
the first place.
This interlock serves no purpose, so take the opportunity to drop it and
remove a global spinlock from the hypervisor.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 6 Sep 2018 14:05:52 +0000 (16:05 +0200)]
x86emul: fix test harness dependencies
The generated header files are what needs to spell out dependencies on
other (real) headers in the main Makefile here, not the intermediate
(helper) .o files produced through testcase.mk.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 6 Sep 2018 14:04:51 +0000 (16:04 +0200)]
x86/hvm: remove default ioreq server (again)
My recent patch [1] to qemu-xen-traditional removes the last use of the
'default' ioreq server in Xen. (This is a catch-all ioreq server that is
used if no explicitly registered I/O range is targetted).
This patch can be applied once that patch is committed, to remove the
(>100 lines of) redundant code in Xen.
The previous version of this patch caused a QEMU build failure. This has
been fixed by extending the #ifdef around deprecated HVM_PARAM declarations
to __XEN_TOOLS__ as well as __XEN__.
NOTE: The removal of the special case for HVM_PARAM_DM_DOMAIN in
hvm_allow_set_param() is not directly related to removal of
default ioreq servers. It could have been cleaned up at any time
after commit 9a422c03 "x86/hvm: stop passing explicit domid to
hvm_create_ioreq_server()". It is now added to the new
deprecated sets introduced by this patch.
Olaf Hering [Thu, 6 Sep 2018 14:02:58 +0000 (16:02 +0200)]
xen: add DEBUG_INFO Kconfig symbol
Creating debug info during build is not strictly required at runtime.
Make it optional by introducing a new Kconfig knob "DEBUG_INFO".
This slightly reduces build time and diskusage, if disabled.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 31 Aug 2018 15:22:05 +0000 (17:22 +0200)]
xen: fill topology info for all present cpus
The topology information obtainable via XEN_SYSCTL_cputopoinfo is
filled rather weird: the size of the array is derived from the highest
online cpu number, so in case there are trailing offline cpus they
will not be included.
On a dual core system with 4 threads booted with smt=0 without this
patch xl info -n will print:
Juergen Gross [Fri, 31 Aug 2018 15:22:04 +0000 (17:22 +0200)]
tools/libxl: correct vcpu affinity output with sparse physical cpu map
With not all physical cpus online (e.g. with smt=0) the output of hte
vcpu affinities is wrong, as the affinity bitmaps are capped after
nr_cpus bits, instead of using max_cpu_id.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
'xl sysrq' command doesn't work with modern Linux guests with the following
message in guest's log:
xen:manage: sysrq_handler: Error -13 writing sysrq in control/sysrq
xenstore trace confirms:
IN 0x24bd9a0 20180904 04:36:32 WRITE (control/sysrq )
OUT 0x24bd9a0 20180904 04:36:32 ERROR (EACCES )
The problem seems to be in the fact that we don't pre-create control/sysrq
xenstore node and libxl_send_sysrq() doing libxl__xs_printf() creates it as
read-only. As we want to allow guests to clean 'control/sysrq' after the
requested action is performed, we need to make this node writable.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
tools/xl: fix output of xl vcpu-pin dry run with smt=0
Fix another smt=0 fallout: xl -N vcpu-pin prints only parts of the
affinities as it is using the number of online cpus instead of the
maximum cpu number.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Tue, 4 Sep 2018 16:15:18 +0000 (17:15 +0100)]
x86: change name of parameter for various invlpg functions
They all incorrectly named a parameter virtual address while it should
have been linear address.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 3 Sep 2018 11:48:13 +0000 (12:48 +0100)]
xen/domain: Fold xsm_free_security_domain() paths together
xsm_free_security_domain() is idempotent (both the dummy handler, and the
flask handler). Move it into the shared __domain_destroy() path, and drop the
INIT_xsm flag from domain_create()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 3 Sep 2018 11:10:48 +0000 (12:10 +0100)]
xen/domain: Call lock_profile_deregister_struct() from common code
lock_profile_register_struct() is called from common code, but the matching
deregister was previously only called from x86 code.
The practical upshot of this when using CONFIG_LOCK_PROFILE, destroyed domains
on ARM (and in particular, the freed page behind struct domain) remain on the
lockprofile linked list, which will become corrupt when the page is reused.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>