Anthony PERARD [Tue, 3 May 2016 15:59:49 +0000 (16:59 +0100)]
configure: Fix when no libsystemd compat lib are available
From systemd change log, since version 209, libsystemd.so contain
everything, including libsystemd-daemon.so. Distro may, or may not provide
the compatibility libraries which libsystemd-daemon is part of.
So, if libsystemd-daemon is not available, check for the presence of
a recent enough libsystemd.
George Dunlap [Fri, 5 Aug 2016 12:07:57 +0000 (14:07 +0200)]
xen: Remove buggy initial placement algorithm
The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities. Just get rid of it and rely on the schedulers to do
initial placement.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100
George Dunlap [Fri, 5 Aug 2016 12:07:27 +0000 (14:07 +0200)]
xen: Have schedulers revise initial placement
The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available. Unfortunately, the logic can end
up picking a pcpu that's not in the online mask. When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.
Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).
To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement. This will
automatically take all knowledge the scheduler has into account.
csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken. Grab and release the lock to minimize time spend with irqs
disabled.
Euan Harris [Fri, 5 Aug 2016 11:50:47 +0000 (13:50 +0200)]
nested vmx: Validate host VMX MSRs before accessing them
Some VMX MSRs may not exist on certain processor models, or may
be disabled because of configuration settings. It is only safe to
access these MSRs if configuration flags in other MSRs are set. These
prerequisites are listed in the Intel 64 and IA-32 Architectures
Software Developer’s Manual, Vol 3, Appendix A.
nvmx_msr_read_intercept() does not check the prerequisites before
accessing MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
MSR_IA32_VMX_VMFUNC on the host. Accessing these MSRs from a nested
VMX guest running on a host which does not support them will cause
Xen to crash with a GPF.
Signed-off-by: Euan Harris <euan.harris@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5e02972646132ad98c365ebfcfcb43b40a0dde36
master date: 2016-06-13 12:44:32 +0100
Andrew Cooper [Wed, 15 Jun 2016 17:32:14 +0000 (18:32 +0100)]
x86/entry: Avoid SMAP violation in compat_create_bounce_frame()
A 32bit guest kernel might be running on user mappings.
compat_create_bounce_frame() must whitelist its guest accesses to avoid
risking a SMAP violation.
For both variants of create_bounce_frame(), re-blacklist user accesses if
execution exits via an exception table redirection.
This is XSA-183 / CVE-2016-6259
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 11 Jul 2016 13:32:03 +0000 (14:32 +0100)]
x86/pv: Remove unsafe bits from the mod_l?_entry() fastpath
All changes in writeability and cacheability must go through full
re-validation.
Rework the logic as a whitelist, to make it clearer to follow.
This is XSA-182
Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
This is because grant_table.h contains this (note the
apostrophe): "granter\92s memory", and `grep -v', in version
2.23, stops processing the file (while, for instance,
until 2.22, this was not happening).
Although the above behavior is likely an issue in grep,
(https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22461)
I think we better switch to using " ' " in that line
anyway, as we do basically everywhere else (even in
the same file).
Wei Liu [Thu, 26 May 2016 15:11:42 +0000 (16:11 +0100)]
libxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU
XSA-180 provides a patch to QEMU to bodge QEMU logging issue. We
explicitly set the limit in libxl for 4.7.
Introduce a function for setting the environment variable and call it in
the right places.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b0d409d9c4944ed29d29457fea4ad6b41d698eca)
This version of libxl does not pass a dm_envs to the
*build_device_model_args* functions. Instead, call
libxl__set_qemu_env_for_xsa_180 in libxl__spawn_local_dm.
The other call ultimate site of *build_device_model_args* (ie of
libxl__build_device_model_args) is in libxl__spawn_stub_dm, where we
don't need to set the env var.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 8 Jun 2016 14:42:19 +0000 (15:42 +0100)]
libxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename
In "libxl: Do not trust backend for disk eject vdev" (c69871a2fb26 on
xen.git#staging) we changed libxl_evenable_disk_eject to read the
device vdev out of xenstore from the /libxl path, rather than the
backend path, and to read it during setup rather than on each event.
However, the patch has a mistake:
- GCSPRINTF("%s/dev", backend), NULL);
+ GCSPRINTF("%s/vdev", libxl_path), &configured_vdev);
^
Spot the extra "v". This causes configured_vdev always to be NULL.
configured_vdev is passed to [libxl__]strdup.
In Xen 4.6 and later libxl__strdup is used and tolerates NULL.
evg->vdev is set to NULL. This propagates to the `vdev' field in the
generated event. This may or may not cause further trouble, depending
on the calling application. In our osstest test cases it does not
cause any trouble, so the bug goes undetected.
In Xen 4.5 and earlier, the strdup does not tolerate NULL, and libxl
crashes immediately. This has been detected by osstest as a
regression in Xen 4.5.
IMO this patch should be applied immediately to
xen.git#staging-4.5 (to check that it fixes the osstest regression)
xen.git#staging (to check that it does not break master
Subject to passes, it should then be propagated to all supported
stable trees and also be mentioned in an update to XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> CC: security@xenproject.org CC: Jan Beulich <jbeulich@suse.com> CC: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 62b4d4769ca39fd5263da20d786a7b9a80a22d9a)
(cherry picked from commit 8b7a356409023f60f80e9f4b00bba16ad56cd77b)
libxl: keep PoD target adjustment by memory fudge after reload_domain_config()
Commit 56fb5fd623 ("libxl: adjust PoD target by memory fudge, too")
introduced target_memkb adjustment for HVM PoD domains on create,
wherein the value it wrote to target is always 1MiB lower than the
actual target_memkb. Unfortunately, on reboot, it is this value which
is read *unmodified* to feed into the next domain creation; from which
1MiB is subtracted *again*. This means that any guest which reboots
with memory < maxmem will have its memory target decreased by 1MiB on
every boot.
This patch makes it so that when reading target on reboot, we adjust the
value we read *up* by 1MiB, so that the domain will be build with the
appropriate amount of memory and the target will remain the same after
reboot.
This is still not quite a complete fix, as the 1MiB offset is only
subtracted when creating or rebooting; it is not subtracted when 'xl
set-memory' is called. But it will prevent any situations where memory
is continually increased or decreased. A better fix will have to wait
until after the release.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7e81d992b183c47a74eea5ecd613d27950b5cdc3)
(cherry picked from commit 7c5c20d129b1a333a3f85a2d04ecde48b5fd7589)
Ian Jackson [Wed, 4 May 2016 14:17:45 +0000 (15:17 +0100)]
libxl: Document ~/serial/ correctly
xenstore-paths.markdown talked about ~/device/serial/, but that's not
used.
(It is very wrong for this value, which contains a driver domain
filesystem path, to be in the guest's area of xenstore. However, it
is only ever created by libxl and ready by xenconsoled. When it is
created, it inherits the read-only permissions of /local/domain/DOMID.
So there is no security bug.)
This is a followup to XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 14:25:19 +0000 (15:25 +0100)]
libxl: Cleanup: Have libxl__alloc_vdev use /libxl
When allocating a vdev for a new disk, look in /libxl/device, rather
than the frontends directory in xenstore.
This is more in line with the other parts of libxl, which ought not to
trust frontends. In this case, though, there is no security bug prior
to this patch because the frontend is the toolstack domain itself.
If libxl__alloc_vdev were ever changed to take a frontend domain
argument, this patch will fix a latent security bug.
This is a followup to XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Wed, 4 May 2016 15:59:38 +0000 (16:59 +0100)]
libxl: Do not trust backend in channel list
Read the name from /libxl/device. Pass the /libxl path to
libxl__device_channel_from_xenstore.
This removes the final route by which READ_LIBXLDEV might receive a
backend path.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Remove be_path variable which is now no longer used.
Ian Jackson [Wed, 4 May 2016 15:23:57 +0000 (16:23 +0100)]
libxl: Do not trust backend for nic in list
libxl_device_nic_list should use the /libxl path to search for
devices, and for obtaining the device information.
The "type" parameter was always "vif". Abolish it. (In any case,
paths in /libxl/device are named after the frontend type which is
constant, not the backend type which might in future vary.)
Abolish a redundant store to pnic->backend_domid. Before this commit,
that store was not needed because libxl_device_nic_init (called by
libxl__device_nic_from_xenstore) would zero it. Now it overwrites the
correct backend domid with zero; so remove it.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 14:40:18 +0000 (15:40 +0100)]
libxl: Have READ_LIBXLDEV use libxl_path rather than be_path
Fix the just-introduced bug in this macro: now it reads the
trustworthy libxl_path. Change the variable name in the two functions
(nic and channel) which use it.
Shuffling the bump in the carpet along, we now introduce three new
bugs: the three call sites pass a backend path where a frontend path
is expected.
No functional change.
This is part of XSA-178.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Wed, 4 May 2016 15:07:02 +0000 (16:07 +0100)]
libxl: Rename READ_BACKEND to READ_LIBXLDEV
We are going to want to change all the functions that use READ_BACKEND
to get untrustworthy information from the backend, to use trustworthy
information from /libxl.
This will involve replacing READ_BACKEND, which reads from be_path,
with a similar macro READ_LIBXLDEV, which reads from libxl_path.
The macro name change generates a lot of clutter in the diff. So we
break it out into this separate patch. Here, we rename the macro, but
the implementation does not really match the new name.
So, another way to look at this, is that we have transformed the bug:
* All of the backends use READ_BACKEND, which is unsafe
into the new bug:
* READ_LIBXLDEV actually reads be_path, which is unsafe.
There is no functional change as yet.
This is part of XSA-178.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Wed, 4 May 2016 15:18:36 +0000 (16:18 +0100)]
libxl: Rename libxl__device_{nic,channel}_from_xs_be to _from_xenstore
We are going to change these functions to expect, and be passed, a
/libxl path. So it is wrong that they are called _from_xs_be.
Neither function reads anything which isn't found in both places, so
we can and will change the call sites later.
The only remaining function in libxl called *_from_xs_be relates to
PCI devices, for which the backend domain is hardcoded to 0 throughout
the libxl_pci.c.
No functional change.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Fri, 29 Apr 2016 17:29:45 +0000 (18:29 +0100)]
libxl: Do not trust backend for disk; fix driver domain disks list
Rework libxl__device_disk_from_xs_be (which takes a backend path) into
to libxl__device_disk_from_xenstore (which takes a libxl path).
libxl__device_disk_from_xenstore now finds the backend path itself,
although it doesn't use it any more for most of its functions. We
rename the variable from be_path to backend_path to make sure we
didn't miss any cases.
All the data collection is now done by reading from the copy in
/libxl.
libxl_device_disk_list and its helper libxl__append_disk_list (which
used to be libxl__append_disk_list_of_type) need extensive rework,
because they now need to specify the /libxl path rather than the
backend path.
To do that they enumerate disks by looking in the appropriate area in
/libxl. Previously they scanned various of the backend directories in
dom0 (which was broken for driver domains). It is no longer necessary
to enumerate the various disk backends, because they all use the same
paths in /devices. libxl__device_disk_from_xenstore will parse the
type out of the backend path, for itself. (Indeed, it did so before -
the now-gone type parameter to libxl__append_disk_list_of_type wasn't
used other than to construct the directory to list.)
Finally, remove a redundant store to pdisk->backend_domid in
libxl__append_disk_list[_of_type]. Even before this commit, that
store was not needed because libxl_device_disk_init (called by
libxl__device_disk_from_xenstore) would zero it. Now it overwrites
the correct backend domid with zero; so remove it.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Also fix up COLO reads, following rebase
Ian Jackson [Fri, 29 Apr 2016 15:23:35 +0000 (16:23 +0100)]
libxl: Do not trust backend for disk eject vdev
For disk eject, use configured vdev from /libxl, not backend.
The backend directory is writeable by driver domains. This means that
a malicious driver domain could cause libxl to see a wrong vdev,
confusing the user or the toolstack.
Use the vdev from the /libxl space, rather than the backend.
For convenience, we read the vdev from the /libxl space into the evg
during setup and copy it on each event, rather than reading it afresh
each time (which would in any case involve generating or saving a copy
of the relevant /libxl path).
This is part of XSA-178.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Fri, 29 Apr 2016 15:57:14 +0000 (16:57 +0100)]
libxl: Do not trust backend for vtpm in getinfo (uuid)
Use uuid from /libxl, rather than from backend. I think the backend
is not supposed to change the uuid, since it seems to be set by libxl
during setup.
If in fact the backend is supposed to be able to change the uuid, this
patch needs to be dropped and replaced by a patch which makes the vtpm
uuid lookup tolerate bad or missing data.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Fri, 29 Apr 2016 16:18:44 +0000 (17:18 +0100)]
libxl: Do not trust backend for vtpm in getinfo (except uuid)
* Do not check the backend for existence. We have already read the
/libxl path so know that the vtpm exists (or is supposed to); if the
backend doesn't exist then that must be the backend's doing.
* Get the frontend path from the /libxl directory.
* The frontend domid is the guest domid, and does not need to be read
from xenstore (!)
We still attempt to read the uuid from the backend. This will be
fixed in the next patch.
This is part of XSA-178.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Fri, 29 Apr 2016 15:19:28 +0000 (16:19 +0100)]
libxl: Make copy of every xs backend in /libxl in _generic_add
We want to stop libxl trustingly reading information from the backend
directory (since this is, of course, writeable by the backend, which
might be a semi-trusted driver domain).
In principle it is wrong in current libxl for anything to try to
divine virtual device configuration from xenstore: the JSON domain
config ought to supply that, and xenstore should only tell us which
devices actually exist.
However:
Firstly, there are several existing places where configuration
information is retrieved from xenstore rather than JSON. We do not
want to reen gineer this in a security patch.
Secondly, we want to make a security patch which can be backported to
versions of libxl without the JSON configuration machinery.
So we take the expedient approach of keeping a copy of the
configuration somewhere we trust, namely /libxl. This is obviously
fairly low-risk, although it does write significantly more keys in
xenstore.
In this patch we make this change in libxl__device_generic_add. This
is responsible for actually writing the vast majority of device
information to xenstore. There are a few loose ends which will be
dealt with in a moment.
Likewise, changes to readers to use the new location will appear in
further patches.
This is part of XSA-178.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 16:24:32 +0000 (17:24 +0100)]
libxl: Do not trust frontend for channel in getinfo
libxl_device_channel_getinfo needs to examine devices without trusting
frontend-controlled data. So:
* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
reading it from the frontend.
* Tolerate FRONTEND/tty vanishing.
Note that there is a strange off-by-one error in the computation of
both fe_path and libxl_path in libxl_device_channel_getinfo: the
incoming channel->devid, which is copied to channelinfo->devid, has +1
applied to calculate the frontend path (and, after this patch, the
libxl path). I.e., the devid passed to libxl_device_channel_getinfo
must be one less than the actual devid for the device being asked
about.
This is actually a bug which mirrors a bug in
libxl__append_channel_list, which fills in the devids of the channel
devices it finds with sequentially increasing numbers starting at 0.
In the usual case channels have real devids starting at 1 (because
there is the console, which is devid 0, but not a channel). So these
bugs usually cancel out.
We do not address this problem at this time. This bug does not have
any security implications.
This patch is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 16:01:56 +0000 (17:01 +0100)]
libxl: Do not trust frontend for channel in list
libxl_device_channel_list should not trust frontend-provided data.
So it needs to iterate using the /libxl paths, and read the backend
path out of /libxl.
However, it also filters out pure "consoles", which are channels
without a "name". But the name was stored only in the frontend
directory, which the frontend can delete.
So store the name in the backend too. (Ideally we would store it in
/libxl, where the backend can't write to it either, but
libxl__device_console_add not currently have access to the xenstore
transaction used by libxl__device_generic_add. Protection against the
backend will come later, in XSA-178.)
Because the libxl paths are defined to be in terms of the frontend
device types, not the backend device types, it is no longer correct
for libxl__append_channel_list to take a type argument. Abolish this
(with no functional effect).
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 14:58:32 +0000 (15:58 +0100)]
libxl: Do not trust frontend for vtpm list
libxl_device_vtpm_list needs to enumerate and identify devices without
trusting frontend-controlled data. So
* Use the /libxl path to enumerate vtpms.
* Use the /libxl path to find the corresponding backends.
* Parse the backend path to find the backend domid.
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Fri, 29 Apr 2016 18:21:51 +0000 (19:21 +0100)]
libxl: Do not trust frontend for disk in getinfo
* Rename the frontend variable to `fe_path' to check we caught them all
* Read the backend path from /libxl, rather than from the frontend
* Parse the backend domid from the backend path, rather than reading it
from the frontend (and add the appropriate error path and initialisation)
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Wed, 27 Apr 2016 15:08:49 +0000 (16:08 +0100)]
libxl: Do not trust frontend for disk eject event
Use the /libxl path for interpreting disk eject watch events: do not
read the backend path out of the frontend. Instead, use the version
in /libxl. That avoids us relying on the guest-modifiable
$frontend/backend pointer.
To implement this we store the path
/libxl/$guest/device/vbd/$devid/backend
in the evgen structure.
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Tue, 3 May 2016 17:39:36 +0000 (18:39 +0100)]
libxl: Do not trust frontend in libxl__devices_destroy
We need to enumerate the devices we have provided to a domain, without
trusting the guest-writeable (or, at least, guest-deletable) frontend
paths.
Instead, enumerate via, and read the backend path from, /libxl.
The console /libxl path is regular, so the special case for console 0
is not relevant any more: /libxl/GUEST/device/console/0 will be found,
and then libxl__device_destroy will DTRT to the right frontend path.
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Wed, 27 Apr 2016 15:34:19 +0000 (16:34 +0100)]
libxl: Provide libxl__backendpath_parse_domid
Multiple places in libxl need to figure out the backend domid of a
device. This can be discovered easily by looking at the backend path,
which always starts /local/domain/$backend_domid/.
There are no call sites yet.
This is part of XSA-175.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Mon, 16 May 2016 13:56:57 +0000 (14:56 +0100)]
libxl: Record backend/frontend paths in /libxl/$DOMID
This gives us a record of all the backends we have set up for a
domain, which is separate from the frontends in
/local/domain/$DOMID/device.
In particular:
1. A guest has write permission for the frontend path:
/local/domain/$DOMID/device/$KIND/$DEVID
which means that the guest can completely delete the frontend.
(They can't recreate it because they don't have write permission
on the containing directory.)
2. A guest has write permission for the backend path recorded in the
frontend, ie, it can write to
/local/domain/$DOMID/device/$KIND/$DEVID/backend
which means that the guest can break the association between
frontend and backend.
So we can't rely on iterating over the frontends to find all the
backends, or examining a frontend to discover how a device is
configured.
So, have libxl__device_generic_add record the frontend and backend
paths in /libxl/$DOMID/device, and have libxl__device_destroy remove
them again.
Create the containing directory /libxl/GUEST/device in
libxl__domain_make. The already existing xs_rm in devices_destroy_cb
will take care of removing it.
This is part of XSA-175.
Backport note: Backported over 7472ced, which fixes a bug in driver
domain teardown.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 2 Jun 2016 13:19:00 +0000 (14:19 +0100)]
xen/arm: Don't free p2m->root in p2m_teardown() before it has been allocated
If p2m_init() didn't complete successfully, (e.g. due to VMID
exhaustion), p2m_teardown() is called and unconditionally tries to free
p2m->root before it has been allocated. free_domheap_pages() doesn't
tolerate NULL pointers.
This is XSA-181
Reported-by: Aaron Cornelius <Aaron.Cornelius@dornerworks.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Dario Faggioli [Fri, 27 May 2016 12:50:19 +0000 (14:50 +0200)]
sched: avoid races on time values read from NOW()
or (even in cases where there is no race, e.g., outside
of Credit2) avoid using a time sample which may be rather
old, and hence stale.
In fact, we should only sample NOW() from _inside_
the critical region within which the value we read is
used. If we don't, in case we have to spin for a while
before entering the region, when actually using it:
1) we will use something that, at the veryy least, is
not really "now", because of the spinning,
2) if someone else sampled NOW() during a critical
region protected by the lock we are spinning on,
and if we compare the two samples when we get
inside our region, our one will be 'earlier',
even if we actually arrived later, which is a
race.
In Credit2, we see an instance of 2), in runq_tickle(),
when it is called by csched2_context_saved() as it samples
NOW() before acquiring the runq lock. This makes things
look like the time went backwards, and it confuses the
algorithm (there's even a d2printk() about it, which would
trigger all the time, if enabled).
In RTDS, something similar happens in repl_timer_handler(),
and there's another instance in schedule() (in generic code),
so fix these cases too.
While there, improve csched2_vcpu_wake() and and rt_vcpu_wake()
a little as well (removing a pointless initialization, and
moving the sampling a bit closer to its use). These two hunks
entail no further functional changes.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
RTDS: fix another instance of the 'read NOW()' race
which was overlooked in 779511f4bf5ae ("sched: avoid
races on time values read from NOW()").
Andrew Cooper [Fri, 27 May 2016 12:49:28 +0000 (14:49 +0200)]
xen/nested_p2m: Don't walk EPT tables with a regular PT walker
hostmode->p2m_ga_to_gfn() is a plain PT walker, and is not appropriate for a
general L1 p2m walk. It is fine for AMD as NPT share the same format as
normal pagetables. For Intel EPT however, it is wrong.
The translation ends up correct (as the formats are sufficiently similar), but
the control bits in lower 12 bits differ in meaning. A plain PT walker sets
A/D bits (bits 5 and 6) as it walks, but in EPT tables, these are the IPAT and
top bit of EMT (caching type). This in turn causes problem when the EPT
tables are subsequently used.
Replace hostmode->p2m_ga_to_gfn() with nestedhap_walk_L1_p2m() in
paging_gva_to_gfn(), which is the correct function for the task. This
involves making nestedhap_walk_L1_p2m() non-static, and adding
vmx_vmcs_enter/exit() pairs to nvmx_hap_walk_L1_p2m() as it is now reachable
from contexts other than v == current.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: bab2bd8e222de9e596699ac080ea985af828c4c4
master date: 2016-05-18 18:22:06 +0100
Jan Beulich [Fri, 27 May 2016 12:48:58 +0000 (14:48 +0200)]
x86/PoD: skip eager reclaim when possible
Reclaiming pages is pointless when the cache can already satisfy all
outstanding PoD entries, and doing reclaims in that case can be very
harmful to performance when that memory gets used by the guest, but
only to store zeroes there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 556c69f4efb09dd06e6bce4cbb0455287f19d02e
master date: 2016-05-12 18:02:21 +0200
Jan Beulich [Fri, 27 May 2016 12:48:00 +0000 (14:48 +0200)]
IOMMU/x86: per-domain control structure is not HVM-specific
... and hence should not live in the HVM part of the PV/HVM union. In
fact it's not even architecture specific (there already is a per-arch
extension type to it), so it gets moved out right to common struct
domain.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: af07377007d595b5d6422291bb1c932c16d1036f
master date: 2016-05-04 09:44:32 +0200
Jan Beulich [Fri, 27 May 2016 12:47:08 +0000 (14:47 +0200)]
x86: use optimal NOPs to fill the SMEP/SMAP placeholders
Alternatives patching code picks the most suitable NOPs for the
running system, so simply use it to replace the pre-populated ones.
Use an arbitrary, always available feature to key off from, but
hide this behind the new X86_FEATURE_ALWAYS.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/compat: correct SMEP/SMAP NOPs patching
Correct the number of single byte NOPs we want to be replaced in case
neither SMEP nor SMAP are available.
Also simplify the expression adding these NOPs - at that location .
equals .Lcr4_orig, and removing that part of the expression fixes a
bogus ".space or fill with negative value, ignored" warning by very old
gas (which actually is what made me look at those constructs again).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 01a0bd0a7d72be638a359db3f8cf551123467d29
master date: 2016-05-13 18:15:55 +0100
master commit: f5610009529628314c9d1d52b00715fe855fcf06
master date: 2016-05-26 17:26:24 +0100
Jan Beulich [Fri, 27 May 2016 12:46:31 +0000 (14:46 +0200)]
x86: suppress SMEP and SMAP while running 32-bit PV guest code
Since such guests' kernel code runs in ring 1, their memory accesses,
at the paging layer, are supervisor mode ones, and hence subject to
SMAP/SMEP checks. Such guests cannot be expected to be aware of those
two features though (and so far we also don't expose the respective
feature flags), and hence may suffer page faults they cannot deal with.
While the placement of the re-enabling slightly weakens the intended
protection, it was selected such that 64-bit paths would remain
unaffected where possible. At the expense of a further performance hit
the re-enabling could be put right next to the CLACs.
Note that this introduces a number of extra TLB flushes - CR4.SMEP
transitioning from 0 to 1 always causes a flush, and it transitioning
from 1 to 0 may also do.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/compat: Cleanup and further debugging of SMAP/SMEP fixup
* Abstract (X86_CR4_SMEP | X86_CR4_SMAP) behind XEN_CR4_PV32_BITS to avoid
opencoding the invidial bits which are fixed up behind a 32bit PV guests
back.
* Show cr4_pv32_mask in the BUG register dump
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
x86: refine debugging of SMEP/SMAP fix
Instead of just latching cr4_pv32_mask into %rdx, correct the found
wrong value in %cr4 (to avoid triggering another BUG).
Also there is one more place for XEN_CR4_PV32_BITS to be used.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: make SMEP/SMAP suppression tolerate NMI/MCE at the "wrong" time
There is one instruction boundary where any kind of interruption would
break the assumptions cr4_pv32_restore's debug mode checking makes on
the correlation between the CR4 register value and its in-memory cache.
Correct this (see the code comment) even in non-debug mode, or else
a subsequent cr4_pv32_restore would also be misguided into thinking the
features are enabled when they really aren't.
Jan Beulich [Fri, 27 May 2016 12:44:09 +0000 (14:44 +0200)]
x86/P2M: consolidate handling of types not requiring a valid MFN
As noted regarding the mixture of checks in p2m_pt_set_entry(),
introduce a new P2M type group allowing to be used everywhere we
just care about accepting operations with either a valid MFN or a type
permitting to be used without (valid) MFN.
Note that p2m_mmio_dm is not included in P2M_NO_MFN_TYPES, as for the
intended purpose that one ought to be treated similar to p2m_invalid
(perhaps the two should ultimately get folded anyway).
Note further that PoD superpages now get INVALID_MFN used when creating
page table entries (was _mfn(0) before).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: c35eefded2992fc9b979f99190422527650872fd
master date: 2015-11-20 12:38:33 +0100
Julien Grall [Fri, 20 May 2016 13:37:42 +0000 (14:37 +0100)]
xen/arm: p2m: Release the p2m lock before undoing the mappings
Since commit 4b25423a "arch/arm: unmap partially-mapped memory regions",
Xen has been undoing the P2M mappings when an error occurred during
insertion or memory allocation.
This is done by calling recursively apply_p2m_changes, however the
second call is done with the p2m lock taken which will result in a
deadlock for the current processor.
The p2m lock is here to protect 2 threads modifying concurrently the
page tables. However, it does not guarantee the ordering of the
changes. I.e if 2 threads request change on regions that overlaps,
then the result is undefined.
Therefore it is fine to move the recursive call to undo the changes
after the lock is released.
Julien Grall [Fri, 20 May 2016 13:37:41 +0000 (14:37 +0100)]
xen/arm: p2m: apply_p2m_changes: Do not undo more than necessary
Since commit 4b25423a "arch/arm: unmap partially-mapped memory regions",
Xen has been undoing the P2M mappings when an error occurred during
insertion or memory allocation.
The function apply_p2m_changes can work with region not-aligned to a
block size (2MB, 1G) or page size (4K). The mapping will be done by
splitting the region in a set of regions aligned to the size supported
by the page table.
The mapping of a region could fail when it is not possible to allocate
memory for an intermediate table (i.e a new or when shattering a block).
When the mapping is undone, the end of the region is computed using the
base address of the current region and the size of the failing level.
However the failing level may not be the leaf one, therefore unrelated
entries will be removed.
Fix it by removing the mapping from the start address up to the last
region that has been successfully mapped.
Wei Liu [Tue, 17 May 2016 14:40:32 +0000 (16:40 +0200)]
libxl: fix old style declarations
Fix errors like:
/local/work/xen.git/dist/install/usr/local/include/libxl_uuid.h:59:1: error: 'static' is not at beginning of declaration [-Werror=old-style-declaration]
void static inline libxl_uuid_copy_0x040400(libxl_uuid *dst,
^
/local/work/xen.git/dist/install/usr/local/include/libxl_uuid.h:59:1: error: 'inline' is not at beginning of declaration [-Werror=old-style-declaration]
/local/work/xen.git/dist/install/usr/local/include/libxl.h:1233:1: error: 'static' is not at beginning of declaration [-Werror=old-style-declaration]
int static inline libxl_domain_create_restore_0x040200(
^
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: d5b6844942f7b21b24e92bccd85c1249592315c8
master date: 2016-04-20 14:34:04 +0100
Jan Beulich [Tue, 17 May 2016 12:55:32 +0000 (14:55 +0200)]
x86/mm: fully honor PS bits in guest page table walks
In L4 entries it is currently unconditionally reserved (and hence
should, when set, always result in a reserved bit page fault), and is
reserved on hardware not supporting 1Gb pages (and hence should, when
set, similarly cause a reserved bit page fault on such hardware).
This is CVE-2016-4480 / XSA-176.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 46699c7393bd991234b5642763c5c24b6b39a6c4
master date: 2016-05-17 14:41:14 +0200
xen/arm64: ensure that the correct SP is used for exceptions
The ARMv8 architecture has a SPSel ("stack pointer selection") machine
register that allows us to determine which exception level's stack
pointer is loaded when an exception occurs. As we don't want to
use the non-privileged SP_EL0 stack pointer -- or even assume that SP_EL0
points to a valid address in the hypervisor context-- we'll need to ensure
that our EL2 code sets the SPSel to SP_ELn mode, so exceptions that trap
to EL2 use the EL2 stack pointer.
This corrects an issue that can manifest as a hang-on-IRQ on some
arm64 cores if the firmware/bootloader has previously initialized SPSel
to 0; in which case Xen's exceptions will incorrectly use an invalid SP_EL0,
and will endlessly spin on the synchronous abort handler.
Vikram Sethi [Tue, 29 Mar 2016 04:46:12 +0000 (23:46 -0500)]
arm: Fix asynchronous aborts (SError exceptions) due to bogus PTEs
ARMv8 architecture allows performing prefetch data/instructions
from memory locations marked as normal memory. Prefetch does not
mean that the data/instruction has to be used/executed in code
flow. All PTEs that appear to be valid to MMU must contain valid
physical address with proper attributes otherwise MMU table walk
might cause imprecise asynchronous aborts.
The way current XEN code is preparing page tables for frametable
and xenheap memory can create bogus PTEs. This patch fixes the
issue by clearing page table memory before populating EL2 L0/L1
PTEs. Without this patch XEN crashes on Qualcomm Technologies
server chips due to asynchronous aborts.
The speculative/prefetch feature explanation is scattered everywhere
in ARM specification but below two sections have useful information.
E2.8 Memory types and attributes (ver DDI0487A_h)
G4.12.6 External abort on a translation table walk (ver DDI0487A_h)
xen/arm: Force broadcast of TLB and instruction cache maintenance instructions
UP guest may use TLB instructions to flush only on the local CPU.
Therefore, TLB flush will not be broadcasted across all the CPUs within
the same innershareable domain.
When the vCPU is migrated between different CPUs, it may be rescheduled
to a previous CPU where the TLB has not been flushed. The TLB may
contain stale entries which will result to translate incorrectly a VA to
IPA or even cause TLB conflicts.
To avoid a such situation, it is possible to set HCR_EL2.FB, which will
force the broadcast of TLB and instruction cache maintenance instructions.
The performance impact of setting HCR_EL2.FB will depend on how often
a guest makes use of local flush instructions.
ARM64 Linux kernel is SMP-aware (no possibility to build only for UP).
Most of the flush instructions are innershareable. The local flushes are
limited to the boot (1 per CPU) and when a task is getting a new ASIC.
Therefore the impact of setting HCR.FB for those guests is very limited.
ARM32 Linux kernel offers the possibility to be built either for SMP or
UP. The number of local flush is very limited in the former kernel
whilst the latter will only issue local flushes. Therefore there will be
an impact to set HCR.FB for guest kernel only built for UP.
Note that the SMP kernel can run in a domain using 1 vCPU and it
will still make use of innershareable flush instruction.
Looking at other OSes, such as FreeBSD, they are very similar to ARM32
Linux kernel (i.e offering two configuration: SMP and UP).
However, nothing prevents an SMP-aware kernel to make more often use of
local flush instrutions.
In the case that HCR_EL2.FB is not set, Xen would need to:
* Flush all the TLBs for the VMID associated to this domain
* Invalidate all the entries from the branch predictor
* Invalidate all the entries from the instruction cache
Those actions would only be needed when the vCPU is migrating between 2
physical CPUs.
Whilst this solution would have a negative performance impact on kernels
which do not heavily use local flush instructions, this may improve
performance for kernels only built for UP system.
For now implement the easiest solution (i.e setting HCR_EL2.FB). We can
revisit it if the performance impact is too high for UP kernel.
Jan Beulich [Mon, 9 May 2016 11:16:10 +0000 (13:16 +0200)]
x86/shadow: account for ioreq server pages before complaining about not found mapping
prepare_ring_for_helper(), just like share_xen_page_with_guest(),
takes a write reference on the page, and hence should similarly be
accounted for when determining whether to log a complaint.
This requires using recursive locking for the ioreq server lock, as the
offending invocation of sh_remove_all_mappings() is down the call stack
from hvm_set_ioreq_server_state(). (While not strictly needed to be
done in all other instances too, convert all of them for consistency.)
At once improve the usefulness of the shadow error message: Log all
values involved in triggering it as well as the GFN (to aid
understanding which guest page it is that there is a problem with - in
cases like the one here the GFN is invariant across invocations, while
the MFN obviously can [and will] vary).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 77eb5dbeff78bbe549793325520f59ab46a187f8
master date: 2016-05-02 09:20:17 +0200
Jan Beulich [Mon, 9 May 2016 11:15:14 +0000 (13:15 +0200)]
x86/time: fix gtime_to_gtsc for vtsc=1 PV guests
For vtsc=1 PV guests, rdtsc is trapped and calculated from get_s_time()
using gtime_to_gtsc. Similarly the tsc_timestamp, part of struct
vcpu_time_info, is calculated from stime_local_stamp using
gtime_to_gtsc.
However gtime_to_gtsc can return 0, if time < vtsc_offset, which can
actually happen when gtime_to_gtsc is called passing stime_local_stamp
(the caller function is __update_vcpu_system_time).
In that case the pvclock protocol doesn't work properly and the guest is
unable to calculate the system time correctly. As a consequence when the
guest tries to set a timer event (for example calling the
VCPUOP_set_singleshot_timer hypercall), the event will be in the past
causing Linux to hang.
The purpose of the pvclock protocol is to allow the guest to calculate
the system_time in nanosec correctly. The guest calculates as follow:
Given that with vtsc=1:
rdtsc = to_vtsc_scale(NOW() - vtsc_offset)
vcpu_time_info.tsc_timestamp = to_vtsc_scale(vcpu_time_info.system_time - vtsc_offset)
The expression evaluates to NOW(), which is what we want. However when
stime_local_stamp < vtsc_offset, vcpu_time_info.tsc_timestamp is
actually 0. As a consequence the calculated overall system_time is not
correct.
This patch fixes the issue by letting gtime_to_gtsc return a negative
integer in the form of a wrapped around unsigned integer, thus when the
guest subtracts vcpu_time_info.tsc_timestamp from rdtsc will calculate
the right value.
Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d22c9bf7c3067b17cbd9cdfd8b81941dd6fb8d77
master date: 2016-04-28 15:06:56 +0200
During the investigation of very slow dump times of guest images in
Amazon EC2 instance, it was discovered that the
register_oldmem_pfn_is_ram() API implemented by the upstream kernel
commit 997c136f518c5debd63847e78e2a8694f56dcf90:
fs/proc/vmcore.c: add hook to read_from_oldmem() to check
for non-ram pages
was not being called. This was due to the PV driver with the call
to register_oldmem_pfn_is_ram() API was not including the
kernel header file that is used to communicate support of the API in the
kernel. Fix the issue by including the required header file.
Signed-off-by: Mike Meyer <mike.meyer@teradata.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Olaf Hering <olaf@aepfle.de>
Jan Beulich [Mon, 9 May 2016 11:05:42 +0000 (13:05 +0200)]
x86/HVM: fix forwarding of internally cached requests
Forwarding entire batches to the device model when an individual
iteration of them got rejected by internal device emulation handlers
with X86EMUL_UNHANDLEABLE is wrong: The device model would then handle
all iterations, without the internal handler getting to see any past
the one it returned failure for. This causes misbehavior in at least
the MSI-X and VGA code, which want to see all such requests for
internal tracking/caching purposes. But note that this does not apply
to buffered I/O requests.
This in turn means that the condition in hvm_process_io_intercept() of
when to crash the domain was wrong: Since X86EMUL_UNHANDLEABLE can
validly be returned by the individual device handlers, we mustn't
blindly crash the domain if such occurs on other than the initial
iteration. Instead we need to distinguish hvm_copy_*_guest_phys()
failures from device specific ones, and then the former need to always
be fatal to the domain (i.e. also on the first iteration), since
otherwise we again would end up forwarding a request to qemu which the
internal handler didn't get to see.
The adjustment should be okay even for stdvga's MMIO handling:
- if it is not caching then the accept function would have failed so we
won't get into hvm_process_io_intercept(),
- if it issued the buffered ioreq then we only get to the p->count
reduction if hvm_send_ioreq() actually encountered an error (in which
we don't care about the request getting split up).
Also commit 4faffc41d ("x86/hvm: limit reps to avoid the need to handle
retry") went too far in removing code from hvm_process_io_intercept():
When there were successfully handled iterations, the function should
continue to return success with a clipped repeat count.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
x86/HVM: fix forwarding of internally cached requests (part 2)
Commit 96ae556569 ("x86/HVM: fix forwarding of internally cached
requests") wasn't quite complete: hvmemul_do_io() also needs to
propagate up the clipped count. (I really should have re-tested the
forward port resulting in the earlier change, instead of relying on the
testing done on the older version of Xen which the fix was first needed
for.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 96ae556569b8eaedc0bb242932842c3277b515d8
master date: 2016-03-31 14:52:04 +0200
master commit: 670ee15ac1e3de7c15381fdaab0e531489b48939
master date: 2016-04-28 15:09:26 +0200
David Vrabel [Mon, 9 May 2016 11:05:13 +0000 (13:05 +0200)]
x86/fpu: improve check for XSAVE* not writing FIP/FDP fields
The hardware may not write the FIP/FDP fields with a XSAVE*
instruction. e.g., with XSAVEOPT/XSAVES if the state hasn't changed
or on AMD CPUs when a floating point exception is not pending. We
need to identify this case so we can correctly apply the check for
whether to save/restore FCS/FDS.
By poisoning FIP in the saved state we can check if the hardware
writes to this field. The poison value is both: a) non-canonical; and
b) random with a vanishingly small probability of matching a value
written by the hardware (1 / (2^63) = 10^-19).
The poison value is fixed and thus knowable by a guest (or guest
userspace). This could allow the guest to cause Xen to incorrectly
detect that the field has not been written. But: a) this requires the
FIP register to be a full 64 bits internally which is not the case for
all current AMD and Intel CPUs; and b) this only allows the guest (or
a guest userspace process) to corrupt its own state (i.e., it cannot
affect the state of another guest or another user space process).
This results in smaller code with fewer branches and is more
understandable.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Intel confirmed that 64-bit {F,}XRSTOR sign-extend FIP from bit 47.
While leaving the description above intact, modify the code comment
accordingly.
David Vrabel [Mon, 9 May 2016 11:04:26 +0000 (13:04 +0200)]
x86/hvm: add HVM_PARAM_X87_FIP_WIDTH
The HVM parameter HVM_PARAM_X87_FIP_WIDTH to allow tools and the guest
to adjust the width of the FIP/FDP registers to be saved/restored by
the hypervisor. This is in case the hypervisor hueristics do not do
the right thing.
Add this parameter to the set saved during domain save/migrate.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 5d768fb1f3f7b011e7b6e75909c7f4841730de60
master date: 2016-02-26 12:30:11 +0100
David Vrabel [Mon, 9 May 2016 11:03:15 +0000 (13:03 +0200)]
x86/fpu: add a per-domain field to set the width of FIP/FDP
The x86 architecture allows either: a) the 64-bit FIP/FDP registers to
be restored (clearing FCS and FDS); or b) the 32-bit FIP/FDP and
FCS/FDS registers to be restored (clearing the upper 32-bits).
Add a per-domain field to indicate which of these options a guest
needs. The options are: 8, 4 or 0. Where 0 indicates that the
hypervisor should automatically guess the FIP width by checking the
value of FIP/FDP when saving the state (this is the existing
behaviour).
The FIP width is initially automatic but is set explicitly in the
following cases:
- 32-bit PV guest: 4
- Newer CPUs that do not save FCS/FDS: 8
The x87_fip_width field is placed into an existing 1 byte hole in
struct arch_domain.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Fix build.
Tim Deegan [Wed, 16 Mar 2016 16:56:04 +0000 (16:56 +0000)]
x86: limit GFNs to 32 bits for shadowed superpages.
Superpage shadows store the shadowed GFN in the backpointer field,
which for non-BIGMEM builds is 32 bits wide. Shadowing a superpage
mapping of a guest-physical address above 2^44 would lead to the GFN
being truncated there, and a crash when we come to remove the shadow
from the hash table.
Track the valid width of a GFN for each guest, including reporting it
through CPUID, and enforce it in the shadow pagetables. Set the
maximum witth to 32 for guests where this truncation could occur.
This is XSA-173.
Reported-by: Ling Liu <liuling-it@360.cn> Signed-off-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 29 Mar 2016 13:19:51 +0000 (15:19 +0200)]
x86: fix information leak on AMD CPUs
The fix for XSA-52 was wrong, and so was the change synchronizing that
new behavior to the FXRSTOR logic: AMD's manuals explictly state that
writes to the ES bit are ignored, and it instead gets calculated from
the exception and mask bits (it gets set whenever there is an unmasked
exception, and cleared otherwise). Hence we need to follow that model
in our workaround.
This is CVE-2016-3158 / CVE-2016-3159 / XSA-172.
[xen/arch/x86/xstate.c:xrstor: CVE-2016-3158]
[xen/arch/x86/i387.c:fpu_fxrstor: CVE-2016-3159]
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7bd9dc3adfbb014c55f0928ebb3b20950ca9c019
master date: 2016-03-29 14:24:26 +0200
Ross Lagerwall [Fri, 18 Mar 2016 07:09:54 +0000 (08:09 +0100)]
vmx: restore debug registers when injecting #DB traps
Commit a929bee0e652 ("x86/vmx: Fix injection of #DB traps following
XSA-156") prevents an infinite loop in certain #DB traps. However, it
changed the behavior to not call hvm_hw_inject_trap() for #DB and #AC
traps which which means that the debug registers are not restored
correctly and nullified commit b56ae5b48c38 ("VMX: fix/adjust trap
injection").
To fix this, restore the original code path through hvm_inject_trap(),
but ensure that the struct hvm_trap is populated with all the required
data.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: ba22f1f4732acb4d5aebd779122e91753a0e374d
master date: 2016-03-15 12:19:13 +0100
David Vrabel [Fri, 18 Mar 2016 07:09:10 +0000 (08:09 +0100)]
x86: don't flush the whole cache when changing cachability
Introduce the FLUSH_VA_VALID flag to flush_area_mask() and friends to
say that it is safe to use CLFLUSH (i.e., the virtual address is still
valid).
Use this when changing the cachability of the Xen direct mappings (in
response to the guest changing the cachability of its mappings). This
significantly improves performance by avoiding an expensive WBINVD.
This fixes a performance regression introduced by c61a6f74f80eb36ed83a82f713db3143159b9009 (x86: enforce consistent
cachability of MMIO mappings), the fix for XSA-154.
e.g., A set_memory_wc() call in Linux:
before: 4097 us
after: 47 us
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dff593c7b6eb1cfd4591b662a880a0c9325cce40
master date: 2016-03-10 16:51:03 +0100
We must ensure that the prod/cons are only read once and that
the compiler won't try to optimize the reads. That is split
the read of these in multiple instructions influencing later
branch code. As such insert barriers when fetching the cons
and prod index.
Jan Beulich [Fri, 4 Mar 2016 12:16:07 +0000 (13:16 +0100)]
x86emul: limit-check branch targets
All branches need to #GP when their target violates the segment limit
(in 16- and 32-bit modes) or is non-canonical (in 64-bit mode). For
near branches facilitate this via a zero-byte instruction fetch from
the target address (resulting in address translation and validation
without an actual read from memory), while far branches get dealt with
by breaking up the segment register loading into a read-and-validate
part and a write one. The latter at once allows correcting some
ordering issues in how the individual emulation steps get carried out:
Before updating machine state, all exceptions unrelated to that state
updating should have got raised (i.e. the only ones possibly resulting
in partly updated state are faulting memory writes [pushes]).
Note that while not immediately needed here, write and distinct read
emulation routines get updated to deal with zero byte accesses too, for
overall consistency.
Reported-by: 刘令 <liuling-it@360.cn> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 81d3a0b26c1672c60b2a54dd8780e6f6472d2328
master date: 2016-02-26 12:14:39 +0100
Jan Beulich [Fri, 4 Mar 2016 12:14:39 +0000 (13:14 +0100)]
x86emul: fix rIP handling
Deal with rIP just like with any other register: Truncate to designated
width upon entry, write back the zero-extended 32-bit value when
emulating 32-bit code, and leave the upper 48 bits unchanged for 16-bit
code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0640ffb67fb92e2561c63b9308c27b71281fdd72
master date: 2016-02-18 15:05:34 +0100
Each ITARGETSR register are 4-byte wide and the offset is in byte.
The current implementation is computing the end of the range wrongly
resulting to emulate only ITARGETSR{0,1} read-only. The rest will be
treated as read-write.
As 8 registers should be read-only, the end of the range should be
ITARGETSR + (4 * 8) - 1.
For convenience introduce ITARGETSR7 and ITARGETSR8.
Julien Grall [Fri, 4 Mar 2016 12:11:20 +0000 (13:11 +0100)]
xen/arm: vgic-v2: Report the correct GICC size to the guest
The GICv2 DT node is usually used by the guest to know the address/size
of the regions (GICD, GICC...) to map into their virtual memory.
While the GICv2 spec requires the size of the GICC to be 8KB, we
correctly do an 8KB stage-2 mapping but erroneously report 256 in the
device tree (based on GUEST_GICC_SIZE).
I bet we didn't see any issue so far because all the registers except
GICC_DIR lives in the first 256 bytes of the GICC region and all the
guests I have seen so far are driving the GIC with GICC_CTLR.EIOmode =
0.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- fixed some typos in commit message ]
Ian Campbell [Thu, 5 Nov 2015 14:46:12 +0000 (14:46 +0000)]
tools: pygrub: if partition table is empty, try treating as a whole disk
pygrub (in identify_disk_image()) detects a DOS style partition table
via the presence of the 0xaa55 signature at the end of the first
sector of the disk.
However this signature is also present in whole-disk configurations
when there is an MBR on the disk. Many filesystems (e.g. ext[234])
include leading padding in their on disk format specifically to enable
this.
So if we think we have a DOS partition table but do not find any
actual partition table entries we may as well try looking at it as a
whole disk image. Worst case is we probe and find there isn't anything
there.
This was reported by Sjors Gielen in Debian bug #745419. The fix was
inspired by a patch by Adi Kriegisch in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745419#27
Tested by genext2fs'ing my /boot into a new raw image (works) and
then:
dd if=/usr/lib/grub/i386-pc/g2ldr.mbr of=img conv=notrunc bs=512 count=1
to add an MBR (with 0xaa55 signature) to it, which after this patch
also works.
Dirk Behme [Thu, 18 Feb 2016 14:25:43 +0000 (15:25 +0100)]
xen/arm64: Make sure we get all debug output
Starting in the wrong ELx mode I get the following debug output:
...
- Current EL 00000004 -
- Xen must be entered in NS EL2 mode -
- Boot failed -
The output of "Please update the bootloader" is missing here, because
string concatenation in gas, unlike in C, keeps the \0 between each
individual string.
Make sure this is output, too. With this, we get
...
- Current EL 00000004 -
- Xen must be entered in NS EL2 mode -
- Please update the bootloader -
- Boot failed -
as intended.
Signed-off-by: Dirk Behme <dirk.behme@de.bosch.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- added same change to arm32 case ]
master commit: c31d34082555566eb27d0d1eb42962f72fa886d3
master date: 2016-02-18 10:13:42 +0000
Anthony PERARD [Wed, 17 Feb 2016 15:49:49 +0000 (16:49 +0100)]
hvmloader: fix scratch_alloc to avoid overlaps
scratch_alloc() set scratch_start to the last byte of the current
allocation. The value of scratch_start is then reused as is (if it is
already aligned) in the next allocation. This result in a potential reuse
of the last byte of the previous allocation.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4ab3ac074cb1f101f42e02103fa263a1f4f422b5
master date: 2016-02-10 14:46:45 +0100
Jan Beulich [Wed, 17 Feb 2016 15:49:18 +0000 (16:49 +0100)]
x86/nHVM: avoid NULL deref during INVLPG intercept handling
When intercepting (or emulating) L1 guest INVLPG, the nested P2M
pointer may be (is?) NULL, and hence there's no point in calling
p2m_flush(). In fact doing so would cause a dereference of that NULL
pointer at least in the ASSERT() right at the beginning of the
function.
While so far nothing supports hap_invlpg() being reachable from the
INVLPG intercept paths (only INVLPG insn emulation would lead there),
and hence the code in question (added by dd6de3ab99 ["Implement
Nested-on-Nested"]) appears to be dead, this seems to be the change
which can be agreed on as an immediate fix. Ideally, however, the
problematic code would go away altogether. See thread at
lists.xenproject.org/archives/html/xen-devel/2016-01/msg03762.html.
Reported-by: 刘令 <liuling-it@360.cn> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 86c59615f4e7f38df24182f20d9dbdec3299c514
master date: 2016-02-09 13:22:13 +0100
Juergen Gross [Wed, 17 Feb 2016 15:48:37 +0000 (16:48 +0100)]
credit: recalculate per-cpupool credits when updating timeslice
When modifying the timeslice of the credit scheduler in a cpupool the
cpupool global credit value (n_cpus * credits_per_tslice) isn't
recalculated. This will lead to wrong scheduling decisions later.
Juergen Gross [Wed, 17 Feb 2016 15:48:16 +0000 (16:48 +0100)]
credit: update timeslice under lock
When updating the timeslice of the credit scheduler protect the
scheduler's private data by it's lock. Today a possible race could
result only in some weird scheduling decisions during one timeslice,
but further adjustments will need the lock anyway.
Andrew Cooper [Wed, 17 Feb 2016 15:47:52 +0000 (16:47 +0100)]
x86/vmx: don't clobber exception_bitmap when entering/leaving emulated real mode
Most updates to the exception bitmaps set or clear an individual bits.
However, entering or exiting emulated real mode unilaterally clobbers it,
leaving the exit code to recalculate what it should have been. This is error
prone, and indeed currently fails to recalculate the TRAP_no_device intercept
appropriately.
Instead of overwriting exception_bitmap when entering emulated real mode, move
the override into vmx_update_exception_bitmap() and leave exception_bitmap
unmodified.
This means that recalculation is unnecessary, and that the use of
vmx_fpu_leave() and vmx_update_debug_state() while in emulated real mode
doesn't result in TRAP_no_device and TRAP_int3 being un-intercepted.
This is only a functional change on hardware lacking unrestricted guest
support.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78c93adf0a7f6a7abe249a63e7398ca1221a6d25
master date: 2016-02-02 14:00:52 +0100