Adrian Pop [Tue, 4 Sep 2018 04:59:22 +0000 (07:59 +0300)]
x86/altp2m: Allow setting the #VE info page for an arbitrary VCPU
In a classic HVI + Xen setup, the introspection engine would monitor
legacy guest page-tables by marking them read-only inside the EPT; this
way any modification explicitly made by the guest or implicitly made by
the CPU page walker would trigger an EPT violation, which would be
forwarded by Xen to the SVA and thus the HVI agent. The HVI agent would
analyse the modification, and act upon it - for example, a virtual page
may be remapped (its guest physical address changed inside the
page-table), in which case the introspection logic would update the
protection accordingly (remove EPT hook on the old gpa, and place a new
EPT hook on the new gpa). In other cases, the modification may be of no
interest to the introspection engine - for example, the accessed/dirty
bits may be cleared by the operating system or the accessed/dirty bits
may be set by the CPU page walker.
In our tests we discovered that the vast majority of guest page-table
modifications fall in the second category (especially on Windows 10 RS4
x64 - more than 95% of ALL the page-table modifications are irrelevant to
us) - they are of no interest to the introspection logic, but they
trigger a very costly EPT violation nonetheless. Therefore, we decided
to make use of the new #VE & VMFUNC features in recent Intel CPUs to
accelerate the guest page-tables monitoring in the following way:
1. Each monitored page-table would be flagged as being convertible
inside the EPT, thus enabling the CPU to deliver a virtualization
exception to he guest instead of generating a traditional EPT
violation.
2. We inject a small filtering driver inside the protected guest VM,
which would intercept the virtualization exception in order to handle
guest page-table modifications.
3. We create a dedicated EPT view (altp2m) for the in-guest agent, which
would isolate the agent from the rest of the operating system; the
agent will switch in and out of the protected EPT view via the VMFUNC
instruction placed inside a trampoline page, thus making the agent
immune to malicious code inside the guest.
This way, all the page-table accesses would generate a
virtualization-exception inside the guest instead of a costly EPT
violation; the #VE agent would emulate and analyse the modification, and
decide whether it is relevant for the main introspection logic; if it is
relevant, it would do a VMCALL and notify the introspection engine
about the modification; otherwise, it would resume normal instruction
execution, thus avoiding a very costly VM exit.
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Paul Durrant [Tue, 11 Sep 2018 15:01:08 +0000 (16:01 +0100)]
tools: add option to explicitly enable VirtFS in QEMU build
9pfs support has been a documented feature since Xen 4.9, but QEMU will
not be built with backend support unless VirtFS is enabled, which is
predicated on the libcap and libattr dev packages being installed. This is
not obvious to anyone intending to use 9pfs.
This patch adds an 'enable-9pfs' option to configure which, if specified,
will cause '--enable-virtfs' to be passed to QEMU's configure. This will
cause the dependency on libcap and libattr to be called out if the packages
are not in installed.
For completeness, specifying 'disable-9pfs' will cause '--disable-virtfs' to
be passed to QEMU's confgure and not specifying an option will keep the
previous behaviour of predicating VirtFS on whether the libcap and libattr
packages are installed.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Wei Liu <wei.liu2@citrix.com>
xen: sched/Credit2: fix bug when moving CPUs between two Credit2 cpupools
Whether or not a CPU is assigned to a runqueue (and, if yes, to which
one) within a Credit2 scheduler instance must be both a per-cpu and
per-scheduler instance one.
In fact, when we move a CPU between cpupools, we first setup its per-cpu
data in the new pool, and then cleanup its per-cpu data from the old
pool. In Credit2, when there currently is no per-scheduler, per-cpu
data (as the cpu-to-runqueue map is stored on a per-cpu basis only),
this means that the cleanup of the old per-cpu data can mess with the
new per-cpu data, leading to crashes like this:
Basically, when csched2_deinit_pdata() is called for CPU 13, for fully
removing the CPU from Pool-0, per_cpu(13,runq_map) already contain the
id of the runqueue to which the CPU has been assigned in the scheduler
of Pool-1, which means wrong runqueue manipulations happen in Pool-0's
scheduler. Furthermore, at the end of such call, that same runq_map is
updated with -1, which is what causes the BUG_ON in csched2_schedule(),
on CPU 13, to trigger.
So, instead of reverting a2c4e5ab59d "xen: credit2: make the cpu to
runqueue map per-cpu" (as we don't want to go back to having the huge
array in struct csched2_private) add a per-cpu scheduler specific data
structure, like, for instance, Credit1 has already. That (for now) only
contains one field: the id of the runqueue the CPU is assigned to.
Andrew Cooper [Wed, 5 Sep 2018 17:32:52 +0000 (17:32 +0000)]
xen/vcpu: Introduce vcpu_destroy()
Like _domain_destroy(), this will eventually idempotently free all parts of a
struct vcpu.
While breaking apart the failure path of vcpu_create(), rework the codeflow to
be in a line at the end of the function for clarity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 5 Sep 2018 16:48:02 +0000 (16:48 +0000)]
xen/vcpu: Rename the common interfaces for consistency
The vcpu functions are far less consistent than the domain side of things, and
in particular, has vcpu_destroy() for architecture specific functionality.
which makes the vcpu hierarchy consistent with the domain hierarchy.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
tools/tests/depriv/Makefile directly builds the target program from
its C-source. This is problematic when an incremental build is needed
after a header the program is depending on has been modified: in this
case all headers are added into the gcc call and the build will fail.
Correct that by adding a rule for building the program from its .o
file.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 31 Aug 2018 07:02:42 +0000 (01:02 -0600)]
tools/tests: allow depriv-fd-checker to build with really old Linux headers
Assuming it was intentional for this test utility, other than most other
ones, to always be built, I think it would be nice if it didn't fail to
build on really old distros just because of the lack of a TUNGETIFF
definition.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 24 Aug 2018 20:01:40 +0000 (21:01 +0100)]
xen: decouple HVM and IOMMU capabilities
HVM and IOMMU are two distinct hardware features, yet they were
bundled together in sysctl and xl's output.
Decouple them on sysctl level. On toolstack level we still need to
maintain a sensible semantics for `xl info`. Massage the information
according to the following table:
Alexandru Isaila [Mon, 10 Sep 2018 14:27:00 +0000 (16:27 +0200)]
x86/domctl: don't pause the whole domain if only getting vcpu state
This patch is focused on moving changing hvm_save_one() to save one
typecode from one vcpu and now that the save functions get data from a
single vcpu we can pause the specific vcpu instead of the domain.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Alexandru Isaila [Mon, 10 Sep 2018 14:27:00 +0000 (16:27 +0200)]
x86/hvm: remove redundant save functions
This patch removes the redundant save functions and renames the
save_one* to save. It then changes the domain param to vcpu in the
save funcs and adapts print messages in order to match the format of the
other save related messages.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Alexandru Isaila [Mon, 10 Sep 2018 14:26:00 +0000 (16:26 +0200)]
x86/hvm: introduce hvm_save_cpu_msrs_one()
This is used to save data from a single instance.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/mm: change default value for suppress #VE in set_mem_access()
The default value for the "suppress #VE" bit set by set_mem_access()
currently depends on whether the call is made from the same domain (the
bit is set when called from another domain and cleared if called from
the same domain). This patch changes that behavior to inherit the old
suppress #VE bit value if it is already set and to set it to 1
otherwise, which is safer and more reliable.
Signed-off-by: Vlad Ioan Topan <itopan@bitdefender.com> Signed-off-by: Adrian Pop <apop@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
x86/iommu: add map-reserved dom0-iommu option to map reserved memory ranges
Several people have reported hardware issues (malfunctioning USB
controllers) due to iommu page faults on Intel hardware. Those faults
are caused by missing RMRR (VTd) entries in the ACPI tables. Those can
be worked around on VTd hardware by manually adding RMRR entries on
the command line, this is however limited to Intel hardware and quite
cumbersome to do.
In order to solve those issues add a new dom0-iommu=map-reserved
option that identity maps all regions marked as reserved in the memory
map. Note that regions used by devices emulated by Xen (LAPIC, IO-APIC
or PCIe MCFG regions) are specifically avoided. Note that this option
is available to all Dom0 modes (as opposed to the inclusive option
which only works for PV Dom0).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
iommu: make iommu_inclusive_mapping a suboption of dom0-iommu
Introduce a new dom0-iommu=map-inclusive generic option that
supersedes iommu_inclusive_mapping. The previous behavior is preserved
and the option should only be enabled by default on Intel hardware.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Andrew Cooper [Thu, 6 Sep 2018 13:40:56 +0000 (14:40 +0100)]
xen/sched: Re-position the domain_update_node_affinity() call during vcpu construction
alloc_vcpu()'s call to domain_update_node_affinity() has existed for a decade,
but its effort is mostly wasted.
alloc_vcpu() is called in a loop for each vcpu, bringing them into existence.
The values of the affinity masks are still default, which is allcpus in
general, or a processor singleton for pinned domains.
Furthermore, domain_update_node_affinity() itself loops over all vcpus
accumulating the masks, making it quadratic with the number of vcpus.
Move it to be called once after all vcpus are constructed, which has the same
net effect, but with fewer intermediate memory allocations and less cpumask
arithmetic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Tue, 11 Sep 2018 13:06:23 +0000 (15:06 +0200)]
x86/HVM: don't #GP/#SS on wrapping virt->linear translations
Real hardware wraps silently in most cases, so we should behave the
same. Also split real and VM86 mode handling, as the latter really
ought to have limit checks applied.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 11 Sep 2018 13:05:09 +0000 (15:05 +0200)]
x86/shadow: a little bit of style cleanup
Correct indentation of a piece of code, adjusting comment style at the
same time. Constify gl3e pointers and drop a bogus (and useless once
corrected) cast.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Wed, 29 Aug 2018 16:39:10 +0000 (16:39 +0000)]
xen: Fix inconsistent callers of panic()
Callers are inconsistent with whether they pass a newline to panic(),
including adjacent calls in the same function using different styles.
painc() not expecting a newline is inconsistent with most other printing
functions, which is most likely why we've gained so many inconsistencies.
Switch panic() to expect a newline, and update all callers which currently
lack a newline to include one.
This actually reduces the size of .rodata (0x07e3e8 down to 0x07e3a8) because
a number of strings are passed to both panic() and printk(). As they
previously differed by \n alone, they couldn't be merged.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Tue, 11 Sep 2018 09:06:41 +0000 (11:06 +0200)]
SVM: limit GIF=0 region
Use EFLAGS.IF for most ordinary purposes; there's in particular no need
to unduly defer NMI/#MC. Clear GIF only immediately before VMRUN itself.
This has the additional advantage that svm_stgi_label now indeed marks
the only place where GIF gets set.
Note regarding the main STI placement: Quite counterintuitively the
host's EFLAGS.IF continues to have a meaning while the guest runs; see
PM Vol 2 section "Physical (INTR) Interrupt Masking in EFLAGS". Hence we
need to set the flag for the duration of time being in guest context.
However, SPEC_CTRL_ENTRY_FROM_HVM wants to be carried out with EFLAGS.IF
clear.
Note regarding the main STGI placement: It could be moved further up,
but at present SPEC_CTRL_EXIT_TO_HVM is not NMI/#MC-safe.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Tue, 11 Sep 2018 09:03:46 +0000 (11:03 +0200)]
x86/HVM: split page straddling emulated accesses in more cases
Assuming consecutive linear addresses map to all RAM or all MMIO is not
correct. Nor is assuming that a page straddling MMIO access will access
the same emulating component for both parts of the access. If a guest
RAM read fails with HVMTRANS_bad_gfn_to_mfn and if the access straddles
a page boundary, issue accesses separately for both parts.
The extra call to known_gla() from hvmemul_write() is just to preserve
original behavior; for consistency the check also gets added to
hvmemul_rmw() (albeit I continue to be unsure whether we wouldn't better
drop both).
Note that the correctness of this depends on the MMIO caching used
elsewhere in the emulation code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Tue, 11 Sep 2018 09:02:37 +0000 (11:02 +0200)]
x86/HVM: drop hvm_fetch_from_guest_linear()
It can easily be expressed through hvm_copy_from_guest_linear(), and in
two cases this even simplifies callers.
Suggested-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
ebitmap.c:244:32: error: invalid conversion specifier 'Z' [-Werror,-Wformat-invalid-specifier]
"match my size %Zd (high bit was %d)\n", mapunit,
~^
ebitmap.c:245:16: error: format specifies type 'int' but the argument has type 'unsigned long'
[-Werror,-Wformat]
sizeof(u64) * 8, e->highbit);
^~~~~~~~~~~~~~~
ebitmap.c:245:33: error: data argument not used by format string [-Werror,-Wformat-extra-args]
sizeof(u64) * 8, e->highbit);
Use %zd instead of %Zd, which is compliant with C99.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Jan Beulich [Tue, 11 Sep 2018 09:00:01 +0000 (11:00 +0200)]
x86/HVM: meet xentrace's expectations on emulation event data
According to the logic in hvm_mmio_assist_process(), 64 bits of data are
expected with 64-bit addresses, and 32 bits of data with 32-bit ones. I
don't think this is very reasonable, but I'm also not going to touch the
consumer side, the more that it is anyway not very helpful for the code
here to only ever supply 32 bits of data (despite the field being 64
bits wide, and having been even in the 32-bit days of Xen).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Wei Liu [Fri, 7 Sep 2018 10:41:31 +0000 (11:41 +0100)]
mkdeb: use compression level 0
This requires calling dpkg-deb directly and pass it -z0.
It reduces the time to run the mkdeb script from 14 seconds to 3
seconds on my workstation with SSD, from 87s to 15s on a machine
with HDD. The deb file grows from 49M to 58M.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Olaf Hering [Thu, 30 Aug 2018 10:05:11 +0000 (12:05 +0200)]
tools/mkrpm: switch payload to gzip to reduce turnaround time
rpmbuild -bb spents alot of time in compressing the binaries. Reduce the
turnaround time of 'make rpmball' by using gzip as compression tool.
This reduces the buildtime from 'w9.xzdio'/138 seconds to 'w1.gzdio'/88
seconds in my environment.
The downside is an increased filesize of xen.rpm, 19MB vs. 37MB.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
In order to build a tailored pvshim-only binary from Xen. Switch the
PV shim build from the tools firmware into using the new defconfig.
A diff of the .config generated for the pvshim firmware build before
and after this change shows no differences.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/dmar: zap DMAR signature for dom0 once in TBOOT case
Commit 6c298ecc1f ("vtd: Reinstate ACPI DMAR on system shutdown or
S3/S4/S5") did everything for acpi_dmar_zap() call to be unnecessary,
except for invoking the function from acpi_parse_dmar(), which 123c779379 ("VTd/dmar: Tweak how the DMAR table is clobbered")
added several years later.
Some stale comments are also removed, No functional change.
Andrew Cooper [Wed, 29 Aug 2018 16:27:44 +0000 (16:27 +0000)]
xen/ARM+sched: Don't opencode %pv in printk()'s
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Tue, 27 Feb 2018 17:22:40 +0000 (17:22 +0000)]
xen/domctl: Drop vcpu_alloc_lock
Since its introduction in c/s 8cbb5278e "x86/AMD: Add support for AMD's OSVW
feature in guests", the OSVW data has been corrected to be per-domain rather
than per-vcpu, and is initialised during XEN_DOMCTL_createdomain.
Furthermore, because XENPF_microcode_update uses hypercall continuations to
move between CPUs, it drops the vcpu_alloc_lock mid update, meaning that it
didn't provided the interlock guarantee that the OSVW patch was looking for in
the first place.
This interlock serves no purpose, so take the opportunity to drop it and
remove a global spinlock from the hypervisor.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 6 Sep 2018 14:05:52 +0000 (16:05 +0200)]
x86emul: fix test harness dependencies
The generated header files are what needs to spell out dependencies on
other (real) headers in the main Makefile here, not the intermediate
(helper) .o files produced through testcase.mk.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Thu, 6 Sep 2018 14:04:51 +0000 (16:04 +0200)]
x86/hvm: remove default ioreq server (again)
My recent patch [1] to qemu-xen-traditional removes the last use of the
'default' ioreq server in Xen. (This is a catch-all ioreq server that is
used if no explicitly registered I/O range is targetted).
This patch can be applied once that patch is committed, to remove the
(>100 lines of) redundant code in Xen.
The previous version of this patch caused a QEMU build failure. This has
been fixed by extending the #ifdef around deprecated HVM_PARAM declarations
to __XEN_TOOLS__ as well as __XEN__.
NOTE: The removal of the special case for HVM_PARAM_DM_DOMAIN in
hvm_allow_set_param() is not directly related to removal of
default ioreq servers. It could have been cleaned up at any time
after commit 9a422c03 "x86/hvm: stop passing explicit domid to
hvm_create_ioreq_server()". It is now added to the new
deprecated sets introduced by this patch.
Olaf Hering [Thu, 6 Sep 2018 14:02:58 +0000 (16:02 +0200)]
xen: add DEBUG_INFO Kconfig symbol
Creating debug info during build is not strictly required at runtime.
Make it optional by introducing a new Kconfig knob "DEBUG_INFO".
This slightly reduces build time and diskusage, if disabled.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 31 Aug 2018 15:22:05 +0000 (17:22 +0200)]
xen: fill topology info for all present cpus
The topology information obtainable via XEN_SYSCTL_cputopoinfo is
filled rather weird: the size of the array is derived from the highest
online cpu number, so in case there are trailing offline cpus they
will not be included.
On a dual core system with 4 threads booted with smt=0 without this
patch xl info -n will print:
Juergen Gross [Fri, 31 Aug 2018 15:22:04 +0000 (17:22 +0200)]
tools/libxl: correct vcpu affinity output with sparse physical cpu map
With not all physical cpus online (e.g. with smt=0) the output of hte
vcpu affinities is wrong, as the affinity bitmaps are capped after
nr_cpus bits, instead of using max_cpu_id.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
'xl sysrq' command doesn't work with modern Linux guests with the following
message in guest's log:
xen:manage: sysrq_handler: Error -13 writing sysrq in control/sysrq
xenstore trace confirms:
IN 0x24bd9a0 20180904 04:36:32 WRITE (control/sysrq )
OUT 0x24bd9a0 20180904 04:36:32 ERROR (EACCES )
The problem seems to be in the fact that we don't pre-create control/sysrq
xenstore node and libxl_send_sysrq() doing libxl__xs_printf() creates it as
read-only. As we want to allow guests to clean 'control/sysrq' after the
requested action is performed, we need to make this node writable.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
tools/xl: fix output of xl vcpu-pin dry run with smt=0
Fix another smt=0 fallout: xl -N vcpu-pin prints only parts of the
affinities as it is using the number of online cpus instead of the
maximum cpu number.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Tue, 4 Sep 2018 16:15:18 +0000 (17:15 +0100)]
x86: change name of parameter for various invlpg functions
They all incorrectly named a parameter virtual address while it should
have been linear address.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 3 Sep 2018 11:48:13 +0000 (12:48 +0100)]
xen/domain: Fold xsm_free_security_domain() paths together
xsm_free_security_domain() is idempotent (both the dummy handler, and the
flask handler). Move it into the shared __domain_destroy() path, and drop the
INIT_xsm flag from domain_create()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 3 Sep 2018 11:10:48 +0000 (12:10 +0100)]
xen/domain: Call lock_profile_deregister_struct() from common code
lock_profile_register_struct() is called from common code, but the matching
deregister was previously only called from x86 code.
The practical upshot of this when using CONFIG_LOCK_PROFILE, destroyed domains
on ARM (and in particular, the freed page behind struct domain) remain on the
lockprofile linked list, which will become corrupt when the page is reused.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 3 Sep 2018 10:52:17 +0000 (11:52 +0100)]
xen/domain: Break _domain_destroy() out of domain_create() and complete_domain_destroy()
This is the first step in making the destroy path idempotent, and using it in
place of the ad-hoc cleanup paths in the create path.
To begin with, the trivial free operations are broken out. The rest of the
cleanup code will be moved as it is demonstrated (or made) to be idempotent.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 3 Sep 2018 13:22:16 +0000 (14:22 +0100)]
xen/domain: Prepare data for is_{pv,hvm}_domain() as early as possible
Given two subtle failures from getting this wrong before, and more cleanup on
the way, move the setting of d->guest_type as early as possible.
Note that despite moving the assignment of d->guest_type outside of the
is_idle_domain(d) check, it still behaves the same. Previously, system
domains had no direct assignment of d->guest_type and behaved as PV guests
because guest_type_pv has the value 0.
While tidying up the predicate, leave a comment referring to
is_system_domain(), and move the associated ASSERT() to be beside the
assignment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 4 Sep 2018 09:30:29 +0000 (11:30 +0200)]
x86emul: clean up AVX2 insn use in test harness
Drop the pretty pointless conditionals from code testing AVX insns and
properly use AVX2 mnemonics in code testing AVX2 insns (the test harness
is already requiring sufficiently new a compiler/assembler).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 4 Sep 2018 09:29:22 +0000 (11:29 +0200)]
x86emul: extend MASKMOV{Q,DQU} tests
While deriving the first AVX512 pieces from existing code I've got the
(in the end wrong) impression that the emulation of these insns would be
broken. Besides testing that the instructions act as no-ops when the
controlling mask bits are all zero, add ones to also check that the data
merging actually works.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 4 Sep 2018 09:28:30 +0000 (11:28 +0200)]
x86emul: fix FMA scalar operand sizes
FMA insns, unlike the earlier AVX additions, don't use the low opcode
bit to distinguish between single and double vector elements. While the
difference is benign for packed flavors, the scalar ones need to use
VEX.W here. Oddly enough the table entries didn't even use
simd_scalar_fp, but uniformly used simd_packed_fp (implying the
distinction was by [VEX-encoded] opcode prefix).
Split simd_scalar_fp into simd_scalar_opc and simd_scalar_vexw, and
correct FMA scalar table entries to use the latter.
Also correct the scalar insn comments (they only ever use XMM registers
as operands).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Or else it defaults to using 0x100000 as the entry point, which might
or might not point to _start. This is a fix for 09b3907f93.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 1 Aug 2018 13:48:33 +0000 (13:48 +0000)]
x86/hvm: Fix mapping corner case during task switching
hvm_map_entry() can fail for a number of reasons, including for a misaligned
LDT/GDT access which crosses a 4K boundary. Architecturally speaking, this
should be fixed, but Long Mode doesn't support task switches, and no 32bit OS
is going to misalign its LDT/GDT base, which is why this task isn't very high
on the TODO list.
However, the hvm_map_fail error label returns failure without raising an
exception, which interferes with hvm_task_switch()'s exception tracking, and
can cause it to finish and return to guest context as if the task switch had
completed successfully.
Resolve this corner case by folding all the failure paths together, which
causes an hvm_map_entry() failure to result in #TS[SEL]. hvm_unmap_entry()
copes fine with a NULL pointer so can be called unconditionally.
In practice, this is just a latent corner case as all hvm_map_entry() failures
crash the domain, but it should be fixed nevertheless.
Finally, rename hvm_load_segment_selector() to task_switch_load_seg() to avoid
giving the impression that it is usable for general segment loading.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 24 Jan 2018 16:43:55 +0000 (16:43 +0000)]
x86/mm: Drop {HAP,SHADOW}_ERROR() wrappers
Unlike the PRINTK/DEBUG wrappers, these go straight out to the console, rather
than ending up in the debugtrace buffer.
A number of these users are followed by domain_crash(), and future changes
will want to combine the printk() into the domain_crash() call. Expand these
wrappers in place, using XENLOG_ERR before a BUG(), and XENLOG_G_ERR before a
domain_crash().
Perfom some %pv/PRI_mfn/etc cleanup while modifying the invocations, and
explicitly drop some calls which are unnecessary (bad shadow op, and the empty
stubs for incorrect sh_map_and_validate_gl?e() calls).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
The hvmloader binary generated when using LLVM LD doesn't work
properly and seems to get stuck while trying to generate and load the
ACPI tables. This is caused by the layout of the binary when linked
with LLVM LD.
LLVM LD has a different default linker script that GNU LD, and the
resulting hvmloader binary is slightly different:
There's however the PHDR which is not present when using GNU LD.
Fix this by using a very simple linker script that generates the same
binary regardless of whether LLVM or GNU LD is used. By using a linker
script the usage of -Ttext can also be avoided by placing the desired
.text load address directly in the linker script.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 3 Sep 2018 15:51:40 +0000 (17:51 +0200)]
x86/boot: silence MADT table entry logging
Logging disabled LAPIC / x2APIC entries with invalid local APIC IDs
(ones having "broadcast" meaning when used) isn't very useful, and can
be quite noisy on larger systems. Suppress their logging unless
opt_cpu_info is true.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 3 Sep 2018 15:50:10 +0000 (17:50 +0200)]
x86: assorted array_index_nospec() insertions
Don't chance having Spectre v1 (including BCBS) gadgets. In some of the
cases the insertions are more of precautionary nature rather than there
provably being a gadget, but I think we should err on the safe (secure)
side here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
c/s 580c45869 "Call arch_domain_create() as early as possible in
domain_create()" overlooked the fact that ARM uses is_hardware_domain() in at
least two places during arch_domain_create().
when dom0 tries to use the vuart. Judging by other uses of
is_hardware_domain(), I expect the x86 PVH dom0 boot is similarly broken.
Reposition the code which sets up hardware_domain so that the
is_hardware_domain() predicate works correctly all the way through domain
creation.
While moving it, leave a related comment explaining the positioning of the
is_priv assignment, which in hindsight should have been part of c/s ef765ec98
when exactly the same problem was discovered for the is_control_domain()
predicate.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Tested-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Tue, 28 Aug 2018 16:00:36 +0000 (16:00 +0000)]
x86/hvm: Drop hvm_{vmx,svm} shorthands
By making {vmx,svm} in hvm_vcpu into an anonymous union (consistent with
domain side of things), the hvm_{vmx,svm} defines can be dropped, and all code
refer to the correctly-named fields. This means that the data hierachy is no
longer obscured from grep/cscope/tags/etc.
Reformat one comment and switch one bool_t to bool while making changes.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Tue, 28 Aug 2018 15:59:28 +0000 (15:59 +0000)]
x86/svm: Rename arch_svm_struct to svm_vcpu
The suffix and prefix are redundant, and the name is curiously odd. Rename it
to svm_vcpu to be consistent with all the other similar structures. In
addition, rename local arch_svm local variables to svm for further
consistency.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Tue, 28 Aug 2018 15:53:06 +0000 (15:53 +0000)]
x86/vmx: Rename arch_vmx_struct to vmx_vcpu
The suffix and prefix are redundant, and the name is curiously odd. Rename it
to vmx_vcpu to be consistent with all the other similar structures. In
addition, rename local arch_vmx local variables to vmx for further
consistency.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
--- CC: Roger Pau Monné <roger.pau@citrix.com>
Some of the local pointers are named arch_vmx. I'm open to renaming them to
just vmx (like all the other local pointers) if people are happy with the
additional patch delta.
Andrew Cooper [Tue, 28 Aug 2018 15:52:34 +0000 (15:52 +0000)]
x86/hvm: Rename v->arch.hvm_vcpu to v->arch.hvm
The trailing _vcpu suffix is redundant, but adds to code volume. Drop it.
Reflow lines as appropriate. No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Tue, 28 Aug 2018 15:50:41 +0000 (15:50 +0000)]
xen/hvm: Rename d->arch.hvm_domain to d->arch.hvm
The trailing _domain suffix is redundant, but adds to code volume. Drop it.
Reflow lines as appropriate, and switch to using the new XFREE/etc wrappers
where applicable.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 19 Mar 2018 17:07:50 +0000 (17:07 +0000)]
xen/domain: Allocate d->vcpu[] in domain_create()
For ARM, the call to arch_domain_create() needs to have completed before
domain_max_vcpus() will return the correct upper bound.
For each arch's dom0's, drop the temporary max_vcpus parameter, and allocation
of dom0->vcpu.
With d->max_vcpus now correctly configured before evtchn_init(), the poll mask
can be constructed suitably for the domain, rather than for the worst-case
setting.
Due to the evtchn_init() fixes, it no longer calls domain_max_vcpus(), and
ARM's two implementations of vgic_max_vcpus() no longer need work around the
out-of-order call.
From this point on, d->max_vcpus and d->vcpus[] are valid for any domain which
can be looked up by domid.
The XEN_DOMCTL_max_vcpus hypercall is modified to reject any call attempt with
max != d->max_vcpus, which does match the older semantics (not that it is
obvious from the code). The logic to allocate d->vcpu[] is dropped, but at
this point the hypercall still needs making to allocate each vcpu.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Mon, 19 Mar 2018 17:28:50 +0000 (17:28 +0000)]
xen/dom0: Arrange for dom0_cfg to contain the real max_vcpus value
Make dom0_max_vcpus() a common interface, and implement it on ARM by splitting
the existing alloc_dom0_vcpu0() function in half.
As domain_create() doesn't yet set up the vcpu array, the max value is also
passed into alloc_dom0_vcpu0(). This is temporary for bisectibility and
removed in the following patch.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 27 Feb 2018 17:39:37 +0000 (17:39 +0000)]
tools: Pass max_vcpus to XEN_DOMCTL_createdomain
XEN_DOMCTL_max_vcpus is a mandatory hypercall, but nothing actually prevents a
toolstack from unpausing a domain with no vcpus.
Originally, d->vcpus[] was an embedded array in struct domain, but c/s fb442e217 "x86_64: allow more vCPU-s per guest" in Xen 4.0 altered it to being
dynamically allocated. A side effect of this is that d->vcpu[] is NULL until
XEN_DOMCTL_max_vcpus has completed, but a lot of hypercalls blindly
dereference it.
Even today, the behaviour of XEN_DOMCTL_max_vcpus is a mandatory singleton
call which can't change the number of vcpus once a value has been chosen.
In preparation to remote the hypercall, extend xen_domctl_createdomain with
the a max_vcpus field and arrange for all callers to pass the appropriate
value. There is no change in construction behaviour yet, but later patches
will rearrange the hypervisor internals.
For the python stubs, extend the domain_create keyword list to take a
max_vcpus parameter, in lieu of deleting the pyxc_domain_max_vcpus function.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 19 Mar 2018 16:50:46 +0000 (16:50 +0000)]
xen/domain: Call arch_domain_create() as early as possible in domain_create()
This is in preparation to set up d->max_cpus and d->vcpu[] in domain_create(),
and allow later parts of domain construction to have access to the values.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 19 Mar 2018 16:06:24 +0000 (16:06 +0000)]
xen/gnttab: Fold grant_table_{create,set_limits}() into grant_table_init()
Now that the max_{grant,maptrack}_frames are specified from the very beginning
of grant table construction, the various initialisation functions can be
folded together and simplified as a result.
Leave grant_table_init() as the public interface, which is more consistent
with other subsystems.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 27 Feb 2018 17:39:37 +0000 (17:39 +0000)]
xen/domctl: Remove XEN_DOMCTL_set_gnttab_limits
Now that XEN_DOMCTL_createdomain handles the grant table limits, remove
XEN_DOMCTL_set_gnttab_limits (including XSM hooks and libxc wrappers).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 19 Mar 2018 11:19:52 +0000 (11:19 +0000)]
xen/gnttab: Pass max_{grant,maptrack}_frames into grant_table_create()
... rather than setting the limits up after domain_create() has completed.
This removes the common gnttab infrastructure for calculating the number of
dom0 grant frames (as the common grant table code is not an appropriate place
for it to live), opting instead to require the dom0 construction code to pass
a sane value in via the configuration.
In practice, this now means that there is never a partially constructed grant
table for a reference-able domain.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Tue, 27 Feb 2018 17:39:37 +0000 (17:39 +0000)]
tools: Pass grant table limits to XEN_DOMCTL_set_gnttab_limits
XEN_DOMCTL_set_gnttab_limits is a fairly new hypercall, and is strictly
mandatory. As it pertains to domain limits, it should be provided at
createdomain time.
In preparation to remove the hypercall, extend xen_domctl_createdomain with
the fields and arrange for all callers to pass appropriate details. There is
no change in construction behaviour yet, but later patches will rearrange the
hypervisor internals, then delete the hypercall.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Tue, 3 Oct 2017 10:18:37 +0000 (11:18 +0100)]
x86/pv: Deprecate support for paging out the LDT
This code is believed to be vestigial remnant of the PV Windows XP port. It
is not used by Linux, NetBSD, Solaris or MiniOS. Furthermore the
implementation is incomplete; it only functions for a present => not-present
transition, rather than a present => read/write transition.
The for_each_vcpu() is one scalability limitation for PV guests, which can't
reasonably be altered to be continuable. Most importantly however, is that
this only codepath which plays with descriptor frames of a remote vcpu.
A side effect of dropping support for paging the LDT out is that the LDT no
longer automatically cleans itself up on domain destruction. Cover this by
explicitly releasing the LDT frames at the same time as the GDT frames.
Finally, leave some asserts around to confirm the expected behaviour of all
the functions playing with PGT_seg_desc_page references.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>