George Dunlap [Fri, 21 Dec 2018 15:41:11 +0000 (15:41 +0000)]
libxl: Introduce specific username to be used as a reaper
Untrusted device models must be killed by uid rather than by pid for
safety. To do this reliably, we need another uid, not used for any
other purpose, from which to make the kill system call.
When using xen-qemuuser-range-base, we can repurpose
xen-qemuuser-range-base itself as a UID from which to kill other
devicemodel uids (since domain ID 0 should never have a device model
associated with it).
However, we'd like people to be able to use the device_model_user
feature without also defining xen-qemuuser-range-base (which requires
the ability to 'reserve' 32k+ user IDs).
To that end, introduce the xen-qemuuser-reaper id. When killing by
UID, first look for and use that ID if available; then fall back to
xen-qemuuser-range-base.
Document the new call in docs/features/qemu-deprivilege.pandoc.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:10 +0000 (15:41 +0000)]
libxl: Kill QEMU with "reaper" ruid
Using kill(-1) to killing an untrusted dm process with the real uid
equal to the dm_uid isn't guaranteed to succeed: the process in
question may be able to kill the reaper process after the setresuid()
and before the kill().
Instead, set the real uid to the QEMU user for domain 0
(QEMU_USER_RANGE_BASE + 0). The reaper process will still be able to
kill the dm process, but not vice versa.
This, in turn, requires locking to make sure that only one reaper
process is using that uid at a time; otherwise one reaper process may
kill the other reaper process.
Create a lockfile in RUNDIR/dm-reaper-lock, and grab the lock before
executing kill.
In the event that we can't get the lock for some reason, go ahead with
the kill using dm_uid for both real and effective UIDs. This isn't
guaranteed to work, but it's no worse than not trying to kill the
process at all.
NB that this effectively requires admins using device_model_user to
also define xen_qemuuser_range_base; this will be addressed in
subsequent patches.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:09 +0000 (15:41 +0000)]
libxl: Kill QEMU by uid when possible
The privcmd fd that a dm_restrict'ed QEMU has gives it permission to
one specific domain ID. This domain ID will probably eventually be
used again. It is therefore necessary to make absolutely sure that a
rogue QEMU process cannot hang around after its domain has exited.
Killing QEMU by pid is insufficient in this situation, because QEMU
may be able to fork() to escape killing. It is surprisingly tricky to
kill a process which can call fork() without races; the only reliable
way is to use kill(-1) to kill all processes with a given uid.
We can use this method only when we're sure that there's only one QEMU
instance per uid. Add a dm_uid into the domain_build_state struct,
and set it in libxl__domain_get_device_model_uid() when it's safe to
kill by UID. Store this in xenstore next to device-model-pid.
On domain destroy, check to see if device-model-uid is present in
xenstore. If so, fork off a reaper process, setuid to that uid, and
do kill(-9) to kill all uids of that type. Otherwise, carry on
destroying by pid.
While we're here, make libxl__destroy_device_model() consistently:
1. Return an error when anything fails
2. But continue to do as much clean-up as possible
NOTE that this is not yet completely safe: with ruid == dm_uid, the
device model may be able to kill(-9) the 'reaper' process before the
reaper process can kill it. Further patches will address this.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:08 +0000 (15:41 +0000)]
libxl: Make killing of device model asynchronous
Or at least, give it an asynchronous interface so that we can make it
actually asynchronous in subsequent patches.
Create state structures and callback function signatures. Add the
state structure to libxl__destroy_domid_state. Break
libxl__destroy_domid down into two functions.
No functional change intended.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:07 +0000 (15:41 +0000)]
libxl: Do root checks once in libxl__domain_get_device_model_uid
At the moment, we check for equivalence to literal "root" before
deciding whether to add the `runas` command-line option to QEMU. This
is unsatisfactory for several reasons.
First, just because the string doesn't match "root" doesn't mean the
final uid won't end up being zero; in particular, the range_base
calculations may end up producing "0:NNN", which would be root in any
case.
Secondly, it's almost certainly a configuration error if the resulting
uid ends up to be zero; rather than silently do what was specified but
probably not intended, throw an error.
To fix this, check for root once in
libxl__domain_get_device_model_uid. If the result is root, return an
error; if appropriate, set `runas`.
After that, assume that the presence of state->dm_runas implies that a
`runas` argument should be constructed.
One side effect of this is to check whether device_model_user exists
before passing it to qemu, resulting in better error reporting.
While we're here:
- Refactor the function to use the "goto out" idiom
- Use 'rc' rather than 'ret', in line with CODING_STYLE
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:06 +0000 (15:41 +0000)]
dm_depriv: Describe expected usage of device_model_user parameter
A number of subsequent patches rely on as-yet undefined behavior for
what the `device_model_user` parameter does. Rather than implement it
incorrectly (or randomly), or remove the feature, describe an expected
usage for the feature. Further patches will make decisions based on
this expected usage.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:06 +0000 (15:41 +0000)]
libxl: Clean up userlookup_helper_getpw* helper
Bring conventions more in line with libxl__xs_read_checked():
- If found, return 0 and set pointer to non-NULL
- If not found, return 0 and set pointer to NULL
- On error, return libxl-style error number.
Update documentation to match.
Use CODING_STYLE compliant `r` rather than `ret`.
On error, log the error code before returning instead of discarding
it.
Now that it only returns 0 or errno, update caller error checks to be
`if (ret)` rather than `if (ret < 0)`.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:05 +0000 (15:41 +0000)]
libxl: Get rid of support for QEMU_USER_BASE (xen-qemuuser-domidNN)
QEMU_USER_BASE allows a user to specify the UID to use when running
the devicemodel for a specific domain number. Unfortunately, this is
not really practical: It requires nearly 32,000 entries in
/etc/passwd. QEMU_USER_RANGE_BASE is much more practical.
Remove support for QEMU_USER_BASE.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Fri, 21 Dec 2018 15:41:04 +0000 (15:41 +0000)]
libxl: Move dm user determination logic into a helper function
To reliably kill an untrusted devicemodel, we need to know not only
its pid, but its uid. In preparation for this, move the userid
determination logic into a helper function.
Create a new field, `dm_runas`, in libxl__domain_build_state to store
the value during domain creation.
This change also removes unnecessary duplication of the argument
construction code.
While here, clean up some minor CODING_STYLE infractions (space
between * and variable name).
No functional change intended.
While here, delete some trailing whitespace.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Benjamin Sanda [Fri, 21 Dec 2018 16:26:49 +0000 (16:26 +0000)]
xenalyze: Build for Both ARM and x86 Platforms
Modified to provide building of the xenalyze binary for both ARM and
x86 platforms. The xenalyze binary is now built as part of the BIN
list for both platforms.
Signed-off-by: Benjamin Sanda <ben.sanda@dornerworks.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Julien Grall [Fri, 21 Dec 2018 16:26:47 +0000 (16:26 +0000)]
xen/arm: Allow a privileged domain to map foreign page from DOMID_XEN
For auto-translated domain, the only way to map a page to itself is the
using the foreign map API. The current code does not allow mapping page from
special page (such as DOMID_XEN).
As xentrace buffers are shared using DOMID_XEN, it is not possible to use
tracing for Arm.
This could be solved by using the helper get_pg_owner(). This helper will
be able to get a reference on DOMID_XEN and therefore allow mapping for
privileged domain.
This patch replace the call to rcu_lock_domain_by_any_id() with
get_pg_owner(). For consistency, all the call to rcu_unlock_domain are
replaced by put_pg_owner().
Julien Grall [Fri, 21 Dec 2018 16:26:46 +0000 (16:26 +0000)]
xen/arm: Make get_page_from_gfn working with DOMID_XEN
DOMID_XEN is used to share pages beloging to the hypervisor
(e.g trace buffers). Unlike other domains, DOMID_XEN is a non-auto
translated domain and therefore does not have a P2M.
This patch adds a special case for DOMID_XEN in get_page_from_gfn. We
may want to provide "non-auto translated helpers" in the future if we
see more case.
Julien Grall [Fri, 21 Dec 2018 16:26:45 +0000 (16:26 +0000)]
xen/arm: Add support for read-only foreign mappings
Currently, foreign mappings can only be read-write. A follow-up patch will
extend foreign mapping for Xen backend memory (via XEN_DOMID), some of
that memory should only be read accessible for the mapping domain.
Introduce a new p2m_type to cater read-only foreign mappings. For now,
the decision between the two foreign mapping type is based on the type
of the guest page mapped.
Julien Grall [Fri, 21 Dec 2018 16:26:43 +0000 (16:26 +0000)]
xen/arm: p2m: Introduce p2m_get_page_from_gfn
In a follow-up patch, we will need to handle get_page_from_gfn
differently for DOMID_XEN. To keep the code simple move the current
content in a new separate helper p2m_get_page_from_gfn.
Note the new helper is not anymore a static inline function as the helper
is quite complex.
Finally, take the opportunity to use typesafe gfn as the change is
minor.
Benjamin Sanda [Tue, 23 Oct 2018 15:21:35 +0000 (16:21 +0100)]
xen/page_alloc: Move get_pg_owner()/put_pg_owner() from x86 to common code
get_pg_owner() and put_pg_owner() will be necessary in a follow-up
commit to support xentrace on Arm. So move the helper to common code.
Take the opportunity to clean-up a bit the code moved.
Signed-off-by: Benjamin Sanda <ben.sanda@dornerworks.com>
[julien: Rework commit title / turn put_pg_owner to a static inline] Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 21 Dec 2018 07:57:31 +0000 (08:57 +0100)]
x86emul: support AVX512{F,BW} shift/rotate insns
Note that simd_packed_fp for the opcode space 0f38 major opcodes 14 and
15 is not really correct, but sufficient for the purposes here. Further
adjustments may later be needed for the down conversion unsigned
saturating VPMOV* insns, first and foremost for the different Disp8
scaling those ones use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 21 Dec 2018 07:56:35 +0000 (08:56 +0100)]
x86emul: rename evex.br to evex.brs
This is to better reflect that it's an abbreviation for "broadcast,
rounding, or SAE" rather than just "broadcast".
Take the opportunity and also add SDM naming comments to both union vex
and union evex.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
xen/arm: zynqmp: introduce zynqmp specific defines
Introduce zynqmp specific defines for the firmware calls.
See EEMI:
https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf
The error codes are described, under XIlPM Error Codes:
https://www.xilinx.com/support/documentation/user_guides/ug1137-zynq-ultrascale-mpsoc-swdev.pdf
- pm_api_id
These are the EEMI function IDs. Unavoidable.
- pm_ret_status
These are the EEMI return statuses. Unavoidable.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
xen/arm: zynqmp: Forward plaform specific firmware calls
Introduce zynqmp_eemi: a function responsible for implementing access
controls over the firmware calls. Only calls that are allowed are
forwarded to the firmware.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
Introduce platform_smc as a way to handle firmware calls that Xen does
not know about in a platform specific way. This is particularly useful
for implementing the SiP (SoC implementation specific) service calls.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
Julien Grall [Tue, 18 Dec 2018 13:07:39 +0000 (13:07 +0000)]
xen/arm: Stop relocating Xen
At the moment, Xen is relocated towards the end of the memory. While
this has the advantage to free space in low memory, the code is not
compliant with the break-before-make because it requires to switch
between two sets of page-table. This is not entirely trivial to fix as
it would require us to go through an identity mapping and disabling MMU.
Furthermore, it looks like that some platform (such as the Hikey960)
may not be able to bring-up secondary CPUs if the entry is too high.
While Xen should be quite tiny (< 2MB), the current algorithm to
allocate Dom0 memory will allocate memory chunks of at least 128MB.
Those memory chunks will always be 128MB. This means that depending on
where the modules are loaded, an extra 128MB may disappear.
As there are up to 4 modules (initramfs, XSM, kernel, DTB) loaded in
low memory. The problem is not entirely new as you could already waste
512MB of low-memory. The right solution would be to fix the allocation
algorithm. But this is independent from this patch.
For user in control of the memory (such as in U-boot), all modules
should be loaded as much as possible together or outside low-memory (i.e
above 4GB). For other users (i.e Grub/UEFI), I believe the bootloader is
already keeping everything together.
Based on the above, it would be fine to stop relocating Xen. This has
the advantage to simplify the code and should speed-up the boot as
relocation is not necessary anymore.
Note that the break-before-make issue is not fixed by this patch.
Julien Grall [Tue, 18 Dec 2018 18:04:17 +0000 (18:04 +0000)]
xen/arm: Track page accessed between batch of Set/Way operations
At the moment, the implementation of Set/Way operations will go through
all the entries of the guest P2M and flush them. However, this is very
expensive and may render unusable a guest OS using them.
For instance, Linux 32-bit will use Set/Way operations during secondary
CPU bring-up. As the implementation is really expensive, it may be possible
to hit the CPU bring-up timeout.
To limit the Set/Way impact, we track what pages has been of the guest
has been accessed between batch of Set/Way operations. This is done
using bit[0] (aka valid bit) of the P2M entry.
This patch adds a new per-arch helper is introduced to perform actions just
before the guest is first unpaused. This will be used to invalidate the
P2M to track access from the start of the guest.
Julien Grall [Tue, 18 Dec 2018 18:04:16 +0000 (18:04 +0000)]
xen/arm: Implement Set/Way operations
Set/Way operations are used to perform maintenance on a given cache.
At the moment, Set/Way operations are not trapped and therefore a guest
OS will directly act on the local cache. However, a vCPU may migrate to
another pCPU in the middle of the processor. This will result to have
cache with stale data (Set/Way are not propagated) potentially causing
crash. This may be the cause of heisenbug noticed in Osstest [1].
Furthermore, Set/Way operations are not available on system cache. This
means that OS, such as Linux 32-bit, relying on those operations to
fully clean the cache before disabling MMU may break because data may
sits in system caches and not in RAM.
For more details about Set/Way, see the talk "The Art of Virtualizing
Cache Maintenance" given at Xen Summit 2018 [2].
In the context of Xen, we need to trap Set/Way operations and emulate
them. From the Arm Arm (B1.14.4 in DDI 046C.c), Set/Way operations are
difficult to virtualized. So we can assume that a guest OS using them will
suffer the consequence (i.e slowness) until developer removes all the usage
of Set/Way.
As the software is not allowed to infer the Set/Way to Physical Address
mapping, Xen will need to go through the guest P2M and clean &
invalidate all the entries mapped.
Because Set/Way happen in batch (a loop on all Set/Way of a cache), Xen
would need to go through the P2M for every instructions. This is quite
expensive and would severely impact the guest OS. The implementation is
re-using the KVM policy to limit the number of flush:
- If we trap a Set/Way operations, we enable VM trapping (i.e
HVC_EL2.TVM) to detect cache being turned on/off, and do a full
clean.
- We clean the caches when turning on and off
- Once the caches are enabled, we stop trapping VM instructions
Julien Grall [Tue, 18 Dec 2018 18:04:15 +0000 (18:04 +0000)]
xen/arm: vsysreg: Add wrapper to handle sysreg access trapped by HCR_EL2.TVM
A follow-up patch will require to emulate some accesses to system
registers trapped by HCR_EL2.TVM. When set, all NS EL1 writes to the
virtual memory control registers will be trapped to the hypervisor.
This patch adds the infrastructure to passthrough the access to the host
registers.
Note that HCR_EL2.TVM will be set in a follow-up patch dynamically.
Julien Grall [Tue, 18 Dec 2018 18:04:14 +0000 (18:04 +0000)]
xen/arm: vcpreg: Add wrappers to handle co-proc access trapped by HCR_EL2.TVM
A follow-up patch will require to emulate some accesses to some
co-processors registers trapped by HCR_EL2.TVM. When set, all NS EL1 writes
to the virtual memory control registers will be trapped to the hypervisor.
This patch adds the infrastructure to passthrough the access to host
registers. For convenience a bunch of helpers have been added to
generate the different helpers.
Note that HCR_EL2.TVM will be set in a follow-up patch dynamically.
Julien Grall [Mon, 26 Nov 2018 16:12:01 +0000 (16:12 +0000)]
xen/arm: p2m: Add support for preemption in p2m_cache_flush_range
p2m_cache_flush_range does not yet support preemption, this may be an
issue as cleaning the cache can take a long time. While the current
caller (XEN_DOMCTL_cacheflush) does not stricly require preemption, this
will be necessary for new caller in a follow-up patch.
The preemption implemented is quite simple, a counter is incremented by:
- 1 on region skipped
- 10 for each page requiring a flush
When the counter reach 512 or above, we will check if preemption is
needed. If not, the counter will be reset to 0. If yes, the function
will stop, update start (to allow resuming later on) and return
-ERESTART. This allows the caller to decide how the preemption will be
done.
For now, XEN_DOMCTL_cacheflush will continue to ignore the preemption.
Andrew Cooper [Mon, 19 Feb 2018 13:35:58 +0000 (13:35 +0000)]
x86/pv: Expose RDTSCP to PV guests
The final remnanat of PVRDTSCP is that we would emulate RDTSCP even on
hardware which lacked the instruction. RDTSCP is available on almost all
64-bit x86 hardware.
Remove this emulation, drop the TSC_MODE_PVRDTSCP constant, and allow RDTSCP
in a PV guest's CPUID policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 15 Nov 2018 21:04:37 +0000 (21:04 +0000)]
x86/msr: Handle MSR_TSC_AUX consistently for PV and HVM guests
With PVRDTSCP mode removed, handling of MSR_TSC_AUX can move into the common
code. Move its storage into struct vcpu_msrs (dropping the HVM-specific
msr_tsc_aux), and add an RDPID feature check as this bit also enumerates the
presence of the MSR.
Introduce cpu_has_rdpid along with the synthesized cpu_has_msr_tsc_aux to
correct the context switch paths, as MSR_TSC_AUX is enumerated by either
RDTSCP or RDPID.
Drop hvm_msr_tsc_aux() entirely, and use v->arch.msrs->tsc_aux directly.
Update hvm_load_cpu_ctxt() to check that the incoming ctxt.msr_tsc_aux isn't
out of range. In practice, no previous version of Xen ever wrote an
out-of-range value. Add MSR_TSC_AUX to the list of MSRs migrated for PV
guests, but leave the HVM path using the existing space in hvm_hw_cpu.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 19 Feb 2018 14:27:04 +0000 (14:27 +0000)]
x86/pv: Remove deferred RDTSC{,P} handling in pv_emulate_privileged_op()
As noted in c/s 4999bf3e8b "x86/PV: use generic emulator for privileged
instruction handling", these hoops are jumped through to retain the older
behaviour, along with a note suggesting that we should reconsider things.
Part of the reason for retention of the old behaviour was removed by c/s 5b04262079 "x86/time: Rework pv_soft_rdtsc() to aid further cleanup" which in
particular caused it to not write regs->rcx directly.
It does not matter exactly when pv_soft_rdtsc() is called, as Xen's behaviour
is an opaque atomic action from the guests point of view.
Drop all the deferral logic, and leave TSC_AUX uniformly at 0 as PVRDTSCP mode
is being removed. Later changes will make this behave architecturally.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
PVRDTSCP was an attempt to provide Xen-aware userspace with a stable monotonic
clock, and enough information for said userspace to cope with frequency
changes across migrate. However, the PVRDTSCP infrastructure has resulted in
very tangled code, and non-architectural behaviour even in non-PVRDTSCP cases.
Seeing as the functionality has been replaced entirely by improvements in PV
clocks (including being plumbed into the VDSO nowadays), or alternatively by
hardware TSC scaling features, and no-one is aware of any users of this mode,
take the opportunity to remove it.
For now, drop TSC_MODE_PVRDTSCP from tsc_{get,set}_info(). This will catch
and cleanly reject attempts to migrate in a VM configured to use PVRDTSCP,
rather than letting it run and have the wrong timing mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 13 Dec 2018 23:51:41 +0000 (15:51 -0800)]
tools/docs: Remove PVRDTSCP support
PVRDTSCP is believed-unused, and its implementation has adverse consequences
on unrelated functionality in the hypervisor. As a result, support is being
removed.
Modify libxl to provide a slightly more helpful error message if it encounters
PVRDTSCP being selected. While adjusting TSC handling, make libxl check for
errors from the set_tsc hypercall.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Tue, 20 Nov 2018 11:37:19 +0000 (11:37 +0000)]
x86/time: Alter tsc_set_info() to return an error value
Currently, tsc_set_info() performs no parameter checking, and an invalid
tsc_mode goes largely unnoticed. Fix it to reject invalid tsc_modes with
-EINVAL, and update the callers to check the return value.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 10 Dec 2018 11:42:13 +0000 (11:42 +0000)]
x86/hvm: Disallow moving the APIC MMIO window
See the code comment for a full discussion, but in short: guests which
currently run under Xen don't move the window, because moving it has never
worked properly. Implementing support for moving the window is never going to
work architecturally unless we switch to per-vcpu P2Ms (which seems very
unlikely), and would still be a substantial quantity of work for a feature
which is unused in practice.
Take the opportunity to rename vlapic_msr_set() to be consistent with the
other MSR handling functions, and return X86EMUL_* constants. Add logic to
check for reserved bits, including refusing x2APIC mode if it has not been
offered to the guest. Move the guest_{rd,wr}msr_x2apic() declarations into
vlapic.h which is a more appropriate place for them to live.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 18 Dec 2018 14:21:17 +0000 (15:21 +0100)]
x86emul: avoid triggering assertions with VME/PVI early #GP check
In commit efe9cba66c ("x86emul: VME and PVI modes require a #GP(0) check
first thing") I neglected the fact that the retire flags get zapped only
in x86_decode(), which hasn't been invoked yet at the point of the #GP(0)
check added. Move output state initialization into a helper function,
and invoke it from the callers of x86_decode() instead of doing it
(possibly too late) in that function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 14:19:47 +0000 (15:19 +0100)]
x86emul: work around SandyBridge errata
There are a number of exception condition related errata on SandyBridge
CPUs, some of which are unexpected #UD (others, of no interest here, are
lack of mandated exceptions, or exceptions of unexpected type). Annotate
the one workaround we already have, and add two more.
Due to the exception recovery we have in place for stub invocations
these aren't security issues.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:27:09 +0000 (14:27 +0100)]
x86emul: fix 3-operand IMUL
While commit 75066cd4ea ("x86emul: fix {,i}mul and {,i}div") indeed did
as its title says, it broke the 3-operand form by uniformly using AL/AX/
EAX/RAX as second source operand. Fix this and add tests covering both
cases.
Reported-by: Andrei Lutas <vlutas@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 18 Dec 2018 13:26:44 +0000 (14:26 +0100)]
x86emul/test: drop another instance of .byte
Now that we require use of the {evex} pseudo-prefix, we can also use
the q-suffixed encoding of VPCMPESTRI, which is available as of 2.29
just like {evex} is.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 30 Nov 2018 16:14:08 +0000 (16:14 +0000)]
x86/hvm: Corrections to RDTSCP intercept handling
For both VT-x and SVM, the RDTSCP intercept will trigger if the pipeline
supports the instruction, but the guest may not have RDTSCP in its featureset.
Bring the vmexit handlers in line with the main emulator behaviour by
optionally handing back #UD.
Next on the AMD side, if RDTSCP actually ends up being intercepted on a debug
build or first-gen SVM hardware which lacks NRIP, we first update regs->rcx,
then call __get_instruction_length() asking for RDTSC. As the two
instructions are different (and indeed, different lengths!),
__get_instruction_length_from_list() fails and hands back a #GP fault.
This can demonstrated by putting a guest into tsc_mode="always emulate" and
executing an RDTSCP instruction:
(d1) --- Xen Test Framework ---
(d1) Environment: HVM 64bit (Long mode 4 levels)
(d1) Test rdtscp
(d1) TSC mode 1
(XEN) emulate.c:147:d1v0 __get_instruction_length: Mismatch between expected and actual instruction:
(XEN) emulate.c:152:d1v0 insn_index 8, opcode 0xf0031 modrm 0
(XEN) emulate.c:154:d1v0 rip 0x10475f, nextrip 0x104762, len 3
(XEN) SVM insn len emulation failed (1): d1v0 64bit @ 0008:0010475f -> 0f 01 f9 0f 31 5b 31 ff 31 c0 e9 c2 db ff ff 00
(d1) ******************************
(d1) PANIC: Unhandled exception at 0008:000000000010475f
(d1) Vec 13 #GP[0000]
(d1) ******************************
First, teach __get_instruction_length() to cope with RDTSCP, and improve
svm_vmexit_do_rdtsc() to ask for the correct instruction. Move the regs->rcx
adjustment into this function to ensure it gets done after we are done
potentially raising faults.
Reported-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Juergen Gross [Mon, 10 Dec 2018 11:44:22 +0000 (12:44 +0100)]
xen: add CONFIG item for default dom0 memory size
With being able to specify a dom0_mem value depending on host memory
size on x86 make it easy for distros to specify a default dom0 size by
adding a CONFIG_DOM0_MEM item which presets the dom0_mem boot parameter
value.
It will be used only if no dom0_mem parameter was specified in the
boot parameters.
Andrii Anisov [Wed, 12 Dec 2018 18:20:55 +0000 (20:20 +0200)]
arm/irq: skip action availability check for non-debug build
Under desc->lock taken:
An IRQ with _IRQ_GUEST flag set always has an action.
An IRQ with _IRQ_DISABLED flag cleared always has an action.
Those flags checks cover all accesses to desc->action in do_IRQ,
so we can skip desc->action check in non-debug build.
Keep in place for debug build to help diagnostics potential
misconfiguration.
Andrii Anisov [Wed, 12 Dec 2018 18:20:54 +0000 (20:20 +0200)]
gic-vgic: Drop an excessive clear_lrs
This action is excessive because for an invalid LR there is no need
to write another invalid value to a register. So we can skip it here,
saving a peripheral register write.
Keep clearing the LR for the DEBUG build. This would make dumped
invalid LRs be zero. That is more obvious than picking state bits
from a non-zero value.
Paul Durrant [Thu, 13 Dec 2018 11:01:50 +0000 (12:01 +0100)]
amd-iommu: remove page merging code
The page merging logic makes use of bits 1-8 and bit 63 of a PTE, which
used to be specified as 'ignored'. However, bits 5 and 6 are now specified
as 'accessed' and 'dirty' bits and their use only remains safe as long as
the DTE 'Host Access Dirty' bits remain unused by Xen, or by hardware
before the domain starts running. (XSA-275 disabled the operation of the
code after domain creation completes).
With the page merging logic present in its current form there are no spare
ignored bits in the PTE at all, but PV-IOMMU support will require at least
one spare bit to track which PTEs are added by hypercall.
This patch removes the code, freeing up the remaining PTE ignored bits
for other use, including PV-IOMMU support, as well as significantly
simplifying and shortening the source by ~170 lines. There may be some
marginal performance cost (but none has been observed in manual testing
with a passed-through NVIDIA GPU) since higher order mappings will now be
ruled out until a mapping order parameter is passed to iommu_ops. That will
be dealt with by a subsequent patch though.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Julien Grall [Thu, 29 Nov 2018 11:37:43 +0000 (11:37 +0000)]
xen/arm: mm: Set-up page permission for Xen mappings earlier on
Xen mapping is first create using a 2MB page and then shatterred in 4KB
page for fine-graine permission. However, it is not safe to break-down
superpage page without going to an intermediate step invalidating
the entry.
As we are changing Xen mappings, we cannot go through the intermediate
step. The only solution is to create Xen mapping using 4KB entries
directly. As the Xen should always access the mappings according with
the runtime permission, it is then possible to set-up the permissions
while create the mapping.
We are still playing with the fire as there are still some
break-before-make issue in setup_pagetables (i.e switch between 2 sets of
page-tables). But it should slightly be better than the current state.
Julien Grall [Thu, 29 Nov 2018 19:02:09 +0000 (19:02 +0000)]
xen/arm: p2m: Rework p2m_cache_flush_range
A follow-up patch will add support for preemption in p2m_cache_flush_range.
Because of the complexity for the 2 loops, it would be necessary to add
preemption in both of them.
This can be avoided by merging the 2 loops together and still keeping
the code fairly simple to read and extend.
Julien Grall [Mon, 26 Nov 2018 14:25:54 +0000 (14:25 +0000)]
xen/arm: traps: Rework leave_hypervisor_tail
The function leave_hypervisor_tail is called before each return to the
guest vCPU. It has two main purposes:
1) Process physical CPU work (e.g rescheduling) if required
2) Prepare the physical CPU to run the guest vCPU
2) will always be done once we finished to process physical CPU work. At
the moment, it is done part of the last iterations of 1) making adding
some extra indentation in the code.
This could be streamlined by moving out 2) of the loop. At the same
time, 1) is moved in a separate function making more obvious what is
happening.
All those changes will help a follow-up patch where we would want to
introduce some vCPU work before returning to the guest vCPU.
Julien Grall [Mon, 6 Aug 2018 16:47:54 +0000 (17:47 +0100)]
xen/arm: p2m: Extend p2m_get_entry to return the value of bit[0] (valid bit)
With the recent changes, a P2M entry may be populated but may not be
valid. In some situation, it would be useful to know whether the entry
has been marked available to guest in order to perform a specific
action. So extend p2m_get_entry to return the value of bit[0] (valid bit).
Julien Grall [Wed, 21 Feb 2018 14:18:44 +0000 (14:18 +0000)]
xen/arm: p2m: Allow to flush cache on any RAM region
Currently, we only allow to flush cache on regions mapped as p2m_ram_{rw,ro}.
There are no real problem in cache flushing any RAM regions such as grants
and foreign mapping. Therefore, relax the check to allow flushing the
cache on any RAM region.
xen/arm: p2m: Introduce a function to resolve translation fault
Currently a Stage-2 translation fault could happen:
1) MMIO emulation
2) Another pCPU was modifying the P2M using Break-Before-Make
3) Guest Physical address is not mapped
A follow-up patch will re-purpose the valid bit in an entry to generate
translation fault. This would be used to do an action on each entry to
track pages used for a given period.
When receiving the translation fault, we would need to walk the pages
table to find the faulting entry and then toggle valid bit. We can't use
p2m_lookup() for this purpose as it only tells us the mapping exists.
So this patch adds a new function to walk the page-tables and updates
the entry. This function will also handle 2) as it also requires walking
the page-table.
The function is able to cope with both table and block entry having the
validate bit unset. This gives flexibility to the function clearing the
valid bits. To keep the algorithm simple, the fault will be propating
one-level down. This will be repeated until a block entry has been
reached.
At the moment, there are no action done when reaching a block/page entry
but setting the valid bit to 1.
Julien Grall [Wed, 21 Feb 2018 14:18:40 +0000 (14:18 +0000)]
xen/arm: p2m: Handle translation fault in get_page_from_gva
A follow-up patch will re-purpose the valid bit of LPAE entries to
generate fault even on entry containing valid information.
This means that when translating a guest VA to guest PA (e.g IPA) will
fail if the Stage-2 entries used have the valid bit unset. Because of
that, we need to fallback to walk the page-table in software to check
whether the fault was expected.
This patch adds the software page-table walk on all the translation
fault. It would be possible in the future to avoid pointless walk when
the fault in PAR_EL1 is not a translation fault.
This function has only worked for guest RAM pages (no foreing mappings or
MMIO mappings) because we require the page to belong to the domain for
getting a reference. This means we can deny all non guest RAM pages.
Julien Grall [Wed, 21 Feb 2018 14:18:42 +0000 (14:18 +0000)]
xen/arm: p2m: Introduce p2m_is_valid and use it
The LPAE format allows to store information in an entry even with the
valid bit unset. In a follow-up patch, we will take advantage of this
feature to re-purpose the valid bit for generating a translation fault
even if an entry contains valid information.
So we need a different way to know whether an entry contains valid
information. It is possible to use the information hold in the p2m_type
to know for that purpose. Indeed all entries containing valid
information will have a valid p2m type (i.e p2m_type != p2m_invalid).
This patch introduces a new helper p2m_is_valid, which implements that
idea, and replace most of lpae_is_valid call with the new helper. The ones
remaining are for TLBs handling and entries accounting.
With the renaming there are 2 others changes required:
- Generate table entry with a valid p2m type
- Detect new mapping for proper stats accounting
Julien Grall [Wed, 21 Feb 2018 14:18:44 +0000 (14:18 +0000)]
xen/arm: Introduce helpers to clear/flags flags in HCR_EL2
A couple of places in the code will need to clear/set flags in HCR_EL2
for a given vCPU and then replicate into the hardware. Introduce
helpers and replace open-coded version.
Juergen Gross [Tue, 11 Dec 2018 08:43:00 +0000 (09:43 +0100)]
x86: add dom0 memory sizing variants
Today the memory size of dom0 can be specified only in terms of bytes
(either an absolute value or "host-mem - value"). When dom0 shouldn't
be auto-ballooned this requires nearly always a manual adaption of the
Xen boot parameters to reflect the actual host memory size.
Add more possibilities to specify memory sizes. Today we have:
dom0_mem= List of ( min:<size> | max:<size> | <size> )
with <size> being a positive or negative size value (e.g. 1G).
Modify that to:
dom0_mem= List of ( min:<sz> | max:<sz> | <sz> )
<sz>: <size> | [<size>+]<frac>%
<frac>: integer value < 100
With the following semantics:
<frac>% specifies a fraction of host memory size in percent.
<sz> is a percentage of host memory plus an offset.
So <sz> being 1G+25% on a 256G host would result in 65G.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 11 Dec 2018 08:42:20 +0000 (09:42 +0100)]
modify parse_size_and_unit() to support percentage
Modify parse_size_and_unit() to support a value followed by a '%'
character. In this case ps is required to be non-NULL to ensure the
caller can detect that case. The returned value will be the integer
value s was pointing to and *ps will point to the '%' character.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 17:00:47 +0000 (17:00 +0000)]
x86/VT-x: Don't activate VMCS Shadowing outside of nested vmx mode
By default on capable hardware, SECONDARY_EXEC_ENABLE_VMCS_SHADOWING is
activated unilaterally. The VMCS Link pointer is initialised to ~0, but the
VMREAD/VMWRITE bitmap pointers are not.
This causes the 16bit IVT and Bios Data Area get interpreted as the read/write
permission bitmap for guests which blindly execute VMREAD/VMWRITE
instructions.
This is not a security issue because the VMCS Link pointer being ~0 causes
VMREAD/VMWRITE to complete with VMFailInvalid (rather than modifying a
potential shadow VMCS), and the contents of MFN 0 has already been determined
not to contain any interesting data because of L1TF's ability to read that 4k
frame.
Leave VMCS Shadowing disabled by default, and toggle it in
nvmx_{set,clear}_vmcs_pointer(). This isn't the most efficient course of
action, but it is the most simple way of leaving nested-virt working as it did
before.
While editing construct_vmcs(), collect all default secondary_exec_control
modifications together. The disabling of PML is latently buggy because it
happens after secondary_exec_control are written into the VMCS, although there
is an unconditional update later which writes the correct value into hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 13:43:27 +0000 (13:43 +0000)]
docs/cmdline: Rewrite the cpuid_mask_* section
A large amount of the information here is obsolete since Xen 4.7
To being with, however, this patch marks a change in style for section
headings, due to how HTML anchors are generated. Having more than one
parameter per heading makes an awkward anchor, especially when brace globbing
is used. Furthermore, the misc suffixes such as (AMD only) get included, as
do the escaping for the underscores.
Markdown doesn't require escaped underscores in headings (I'm not entirely
sure how we ended up with that style), so remove them and fully expand the
glob syntax. Also adjust com1,com2 while at it, which is the only other
multi-parameter heading. Move the misc suffixes into an "Applicability:" note
alongside the information about defaults.
This results in the headings being unadorned, and identical to how they are
expressed on the command line and in code.
For cpuid_mask_cpu option, collapse the long line of almost identical strings
using [] globbing. The result is much shorter and clearer to read. Add a
warning that this option no longer masks all features on Fam15h and above, due
to not making use of the leaf 7 masks.
For the remainder of the cpuid_mask_* options, collapse them all together into
a single description.
Finally, leave an explicit note explaining that people should not be using
these options for migration safety.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Dec 2018 13:43:23 +0000 (13:43 +0000)]
docs/cmdline: Fix markdown syntax
* vwfi needs a closing `. rmrr needs one as well, and the opening ' switched
to `
* The com1/com2 example lines are already verbatim blocks and shouldn't
escape their underscores. This ends up in the rendered output.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 6 Dec 2018 14:05:34 +0000 (14:05 +0000)]
x86/pv: Code improvements to do_update_descriptor()
* Add "uint64_t raw" to seg_desc_t to remove the opencoded uint64_t casting
in this function. Change the parameter to be of type seg_desc_t.
* Rename the 'pa' parameter to 'gaddr', because it lives in GFN space rather
than physical address space.
* Use gfn_t and mfn_t rather than unsigned longs.
* Check the alignment and proposed new descriptor before taking a page
reference.
* Use the more flexible ACCESS_ONCE() accessor in preference to
write_atomic()
No expected change in behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 6 Dec 2018 14:05:29 +0000 (14:05 +0000)]
x86: Switch "struct desc_struct" to being seg_desc_t
The struct suffix is redundant in the name, and a future change will want to
turn it into a union, rather than a structure. As this represents a segment
descriptor, give it an appropriate typedef.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Ian Jackson [Mon, 3 Dec 2018 12:05:41 +0000 (12:05 +0000)]
docs/parse-support-md: Allow definition lists for features
Now, as well as a `code block', with
| Something: some status
we tolerate a definition list which in pandoc terms looks like this
|Term
|: Definition
This ought not usually be be used for features but it will be useful
for linking to the release notes, because markup is not allowed in
code blocks but is in definitions.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Juergen Gross <jgross@suse.com>
Ian Jackson [Mon, 3 Dec 2018 12:01:55 +0000 (12:01 +0000)]
docs/parse-support-md: Correct handling of Status
In fact this was not markdown content, but just a string. We are
however going to make it be markdown content. So adjust the comments,
and the consumer.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Juergen Gross <jgross@suse.com>
Paul Durrant [Fri, 7 Dec 2018 17:50:08 +0000 (17:50 +0000)]
x86/hvm/viridian: stop open coding updates to APIC registers
The code in viridian_synic_wrmsr() duplicates logic in vlapic_reg_write()
to update the ICR, ICR2 and TASKPRI registers. Instead of doing this,
make vlapic_reg_write() non-static and call it.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Rename "offset" to "reg" for consistency with the rest of the vlapic API.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
When viridian APIC assist is active, the code in vlapic_has_pending_irq()
may end up re-calling vlapic_find_highest_isr() after emulating an EOI
whereas simply moving the call after the EOI emulation removes the need
for this duplication.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 6 Dec 2018 11:20:55 +0000 (12:20 +0100)]
console: adjust IRQ initialization
In order for a Xen internal PCI device driver to enable MSI on the
device, we need another hook which the driver can use to create the IRQ
(doing this in the init_preirq hook is too early, since IRQ code hasn't
got initialized at that time yet, and doing it in init_postirq is too
late because at least on x86 smp_intr_init() needs to know the IRQ
number).
On x86 this additionally requires a slight ordering change to IRQ
initialization, to facilitate calling the new hook between basic
initialization and the call path leading to smp_intr_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 6 Dec 2018 11:19:04 +0000 (12:19 +0100)]
make domain_adjust_tot_pages() __must_check
Even if unlikely, donate_page() should not ignore the possible need to
obtain a domain reference. To make people look more closely when they
add new uses of domain_adjust_tot_pages(), force its return value to be
checked. This in turn requires a benign change to assign_pages().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 26 Feb 2018 12:45:58 +0000 (12:45 +0000)]
x86/hvm: Handle x2apic MSRs via the new guest_{rd,wr}msr() infrastructure
Dispatch from the guest_{rd,wr}msr() functions. The read side should be safe
outside of current context, but the write side is definitely not. As the
toolstack has no legitimate reason to access the APIC registers via this
interface (not least because whether they are accessible at all depends on
guest settings), unilaterally reject access attempts outside of current
context.
Rename to guest_{rd,wr}msr_x2apic() for consistency, and alter the functions
to use X86EMUL_EXCEPTION rather than X86EMUL_UNHANDLEABLE. The previous
callers turned UNHANDLEABLE into EXCEPTION, but using UNHANDLEABLE will now
interfere with the fallback to legacy MSR handling.
While altering guest_rdmsr_x2apic() make a couple of minor improvements.
Reformat the initialiser for readable[] so it indents in a more natural way,
and alter high to be a 64bit integer to avoid shifting 0 by 32 in the common
path.
Observant people might notice that we now don't let PV guests read the x2apic
MSRs. They should never have been able to in the first place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>