Julien Grall [Mon, 5 Dec 2016 17:43:23 +0000 (17:43 +0000)]
xen/arm: traps: Emulate ICC_SRE_EL1 as RAZ/WI
Recent Linux kernel (4.4 and onwards [1]) is checking whether it is possible
to enable sysreg access (ICC_SRE_EL1.SRE) when the ID register
(ID_AA64PRF0_EL1.GIC) is reporting the presence of the sysreg interface.
When the guest has been configured to use GICv2, the hypervisor will
disable sysreg access for this vm (via ICC_SRE_EL2.Enable) and therefore
access to system register such as ICC_SRE_EL1 are trapped in EL2.
However, ICC_SRE_EL1 is not emulated by the hypervisor. This means that
Linux will crash as soon as it is trying to access ICC_SRE_EL1.
To solve this problem, Xen can implement ICC_SRE_EL1 as read-as-zero
write-ignore. The emulation will only be used when sysreg are disabled
for EL1.
[1] 963fcd409 "arm64: cpufeatures: Check ICC_EL1_SRE.SRE before
enabling ARM64_HAS_SYSREG_GIC_CPUIF"
arm/irq: Reorder check when the IRQ is already used by someone
Call irq_get_domain for the IRQ we are interested in
only after making sure that it is the guest IRQ to avoid
ASSERT(test_bit(_IRQ_GUEST, &desc->status)) triggering.
Jun Sun [Mon, 10 Oct 2016 19:27:56 +0000 (12:27 -0700)]
Don't clear HCR_VM bit when updating VTTBR.
Currently function p2m_restore_state() would clear HCR_VM bit, i.e.,
disabling stage2 translation, before updating VTTBR register. After
some research and talking to ARM support, I got confirmed that this is not
necessary. We are currently working on a new platform that would need this
to be removed.
The patch is tested on FVP foundation model.
Signed-off-by: Jun Sun <jsun@junsun.net> Acked-by: Steve Capper <steve.capper@linaro.org> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Dario Faggioli [Tue, 14 Mar 2017 13:01:12 +0000 (14:01 +0100)]
xen: credit2: don't miss accounting while doing a credit reset.
A credit reset basically means going through all the
vCPUs of a runqueue and altering their credits, as a
consequence of a 'scheduling epoch' having come to an
end.
Blocked or runnable vCPUs are fine, all the credits
they've spent running so far have been accounted to
them when they were scheduled out.
But if a vCPU is running on a pCPU, when a reset event
occurs (on another pCPU), that does not get properly
accounted. Let's therefore begin to do so, for better
accuracy and fairness.
In fact, after this patch, we see this in a trace:
Which shows how d1v5 actually executed for ~9.796 ms,
on pCPU 10, when reset_credit() is executed, on pCPU
12, because of d1v6's credits going below 0.
Without this patch, this 9.796ms are not accounted
to anyone. With this patch, d1v5 is charged for that,
and its credits drop down from 9796548 to 201805.
And this is important, as it means that it will
begin the new epoch with 10201805 credits, instead
of 10500000 (which he would have, before this patch).
Basically, we were forgetting one round of accounting
in epoch x, for the vCPUs that are running at the time
the epoch ends. And this meant favouring a little bit
these same vCPUs, in epoch x+1, providing them with
the chance of execute longer than their fair share.
Dario Faggioli [Tue, 14 Mar 2017 13:00:46 +0000 (14:00 +0100)]
xen: credit2: always mark a tickled pCPU as... tickled!
In fact, whether or not a pCPU has been tickled, and is
therefore about to re-schedule, is something we look at
and base decisions on in various places.
So, let's make sure that we do that basing on accurate
information.
While there, also tweak a little bit smt_idle_mask_clear()
(used for implementing SMT support), so that it only alter
the relevant cpumask when there is the actual need for this.
(This is only for reduced overhead, behavior remains the
same).
Andrew Cooper [Tue, 14 Mar 2017 12:59:14 +0000 (13:59 +0100)]
x86/layout: Correct Xen's idea of its own memory layout
c/s b4cd59fe "x86: reorder .data and .init when linking" had an unintended
side effect, where xen_in_range() and the tboot S3 MAC were no longer correct.
In practice, it means that Xen's .data section is excluded from consideration,
which means:
1) Default IOMMU construction for the hardware domain could create mappings.
2) .data isn't included in the tboot MAC checked on resume from S3.
Adjust the comments and virtual address anchors used to define the regions.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c9a4a1c419cebac83a8fb60c4532ad8ccc973dc4
master date: 2017-02-28 16:18:38 +0000
Andrew Cooper [Tue, 14 Mar 2017 12:58:21 +0000 (13:58 +0100)]
x86/vmx: Don't leak host syscall MSR state into HVM guests
hvm_hw_cpu->msr_flags is in fact the VMX dirty bitmap of MSRs needing to be
restored when switching into guest context. It should never have been part of
the migration state to start with, and Xen must not make any decisions based
on the value seen during restore.
Identify it as obsolete in the header files, consistently save it as zero and
ignore it on restore.
The MSRs must be considered dirty during VMCS creation to cause the proper
defaults of 0 to be visible to the guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 2f1add6e1c8789d979daaafa3d80ddc1bc375783
master date: 2017-02-21 11:06:39 +0000
xen/arm: fix affected memory range by dcache clean functions
clean_dcache_va_range and clean_and_invalidate_dcache_va_range don't
calculate the range correctly when "end" is not cacheline aligned. As a
result, the last cacheline is not skipped. Fix the issue by aligning the
start address to the cacheline size.
In addition, make the code simpler and faster in
invalidate_dcache_va_range, by removing the module operation and using
bitmasks instead. Also remove the size adjustments in
invalidate_dcache_va_range, because the size variable is not used later
on.
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Introduce new Xen command line parameter called "vwfi", which stands for
virtual wfi. The default is "trap": Xen traps guest wfi and wfe
instructions. In the case of wfi, Xen calls vcpu_block on the guest
vcpu; in the case of guest wfe, Xen calls vcpu_yield on the guest vcpu.
The behavior can be changed by setting vwfi to "native", in that case
Xen doesn't trap neither wfi nor wfe, running them in guest context.
The result is strong reduction in irq latency (from 5000ns to 2000ns,
measured using https://github.com/edgarigl/tbm, the physical timer, and
1 pcpu dedicated to 1 vcpu). The downside is that the scheduler thinks
that the guest is busy when actually is sleeping, leading to suboptimal
scheduling decisions.
Dario Faggioli [Tue, 28 Feb 2017 09:47:24 +0000 (10:47 +0100)]
xen: fix a (latent) cpupool-related race during domain destroy
So, during domain destruction, we do:
cpupool_rm_domain() [ in domain_destroy() ]
sched_destroy_domain() [ in complete_domain_destroy() ]
Therefore, there's a window during which, from the
scheduler's point of view, a domain stilsts outside
of any cpupool.
In fact, cpupool_rm_domain() does d->cpupool=NULL,
and we don't allow that to hold true, for anything
but the idle domain (and there are, in fact, ASSERT()s
and BUG_ON()s to that effect).
Currently, we never really check d->cpupool during the
window, but that does not mean the race is not there.
For instance, Credit2 at some point (during load balancing)
iterates on the list of domains, and if we add logic that
needs checking d->cpupool, and any one of them had
cpupool_rm_domain() called on itself already... Boom!
(In fact, calling __vcpu_has_soft_affinity() from inside
balance_load() makes `xl shutdown <domid>' reliably
crash, and this is how I discovered this.)
On the other hand, cpupool_rm_domain() "only" does
cpupool related bookkeeping, and there's no harm
postponing it a little bit.
Also, considering that, during domain initialization,
we do:
cpupool_add_domain()
sched_init_domain()
It makes sense for the destruction path to look like
the opposite of it, i.e.:
sched_destroy_domain()
cpupool_rm_domain()
And hence that's what this patch does.
Actually, for better robustness, what we really do is
moving both cpupool_add_domain() and cpupool_rm_domain()
inside sched_init_domain() and sched_destroy_domain(),
respectively (and also add a couple of ASSERT()-s).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: f3d47501db2b7bb8dfd6a3c9710b7aff4b1fc55b
master date: 2016-08-03 14:14:08 +0100
Jan Beulich [Mon, 20 Feb 2017 15:02:47 +0000 (16:02 +0100)]
VMX: fix VMCS race on context-switch paths
When __context_switch() is being bypassed during original context
switch handling, the vCPU "owning" the VMCS partially loses control of
it: It will appear non-running to remote CPUs, and hence their attempt
to pause the owning vCPU will have no effect on it (as it already
looks to be paused). At the same time the "owning" CPU will re-enable
interrupts eventually (the lastest when entering the idle loop) and
hence becomes subject to IPIs from other CPUs requesting access to the
VMCS. As a result, when __context_switch() finally gets run, the CPU
may no longer have the VMCS loaded, and hence any accesses to it would
fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().
For consistency use the new function also in vmx_do_resume(), to
avoid leaving an open-coded incarnation of it around.
Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de> Reported-by: Anshul Makkar <anshul.makkar@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
master commit: 2f4d2198a9b3ba94c959330b5c94fe95917c364c
master date: 2017-02-17 15:49:56 +0100
George Dunlap [Mon, 20 Feb 2017 15:02:12 +0000 (16:02 +0100)]
xen/p2m: Fix p2m_flush_table for non-nested cases
Commit 71bb7304e7a7a35ea6df4b0cedebc35028e4c159 added flushing of
nested p2m tables whenever the host p2m table changed. Unfortunately
in the process, it added a filter to p2m_flush_table() function so
that the p2m would only be flushed if it was being used as a nested
p2m. This meant that the p2m was not being flushed at all for altp2m
callers.
Only check np2m_base if p2m_class for nested p2m's.
NB that this is not a security issue: The only time this codepath is
called is in cases where either nestedp2m or altp2m is enabled, and
neither of them are in security support.
Reported-by: Matt Leinhos <matt@starlab.io> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6192e6378e094094906950120470a621d5b2977c
master date: 2017-02-15 17:15:56 +0000
David Woodhouse [Mon, 20 Feb 2017 15:01:47 +0000 (16:01 +0100)]
x86/ept: allow write-combining on !mfn_valid() MMIO mappings again
For some MMIO regions, such as those high above RAM, mfn_valid() will
return false.
Since the fix for XSA-154 in commit c61a6f74f80e ("x86: enforce
consistent cachability of MMIO mappings"), guests have no longer been
able to use PAT to obtain write-combining on such regions because the
'ignore PAT' bit is set in EPT.
We probably want to err on the side of caution and preserve that
behaviour for addresses in mmio_ro_ranges, but not for normal MMIO
mappings. That necessitates a slight refactoring to check mfn_valid()
later, and let the MMIO case get through to the right code path.
Since we're not bailing out for !mfn_valid() immediately, the range
checks need to be adjusted to cope \97 simply by masking in the low bits
to account for 'order' instead of adding, to avoid overflow when the mfn
is INVALID_MFN (which happens on unmap, since we carefully call this
function to fill in the EMT even though the PTE won't be valid).
The range checks are also slightly refactored to put only one of them in
the fast path in the common case. If it doesn't overlap, then it
*definitely* isn't contained, so we don't need both checks. And if it
overlaps and is only one page, then it definitely *is* contained.
Finally, add a comment clarifying how that 'return -1' works \97 it isn't
returning an error and causing the mapping to fail; it relies on
resolve_misconfig() being able to split the mapping later. So it's
*only* sane to do it where order>0 and the 'problem' will be solved by
splitting the large page. Not for blindly returning 'error', which I was
tempted to do in my first attempt.
Signed-off-by: David Woodhouse <dwmw@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 30921dc2df3665ca1b2593595aa6725ff013d386
master date: 2017-02-07 14:30:01 +0100
Dario Faggioli [Mon, 20 Feb 2017 15:01:20 +0000 (16:01 +0100)]
xen: credit2: never consider CPUs outside of our cpupool.
In fact, relying on the mask of what pCPUs belong to
which Credit2 runqueue is not enough. If we only do that,
when Credit2 is the boot scheduler, we may ASSERT() or
panic when moving a pCPU from Pool-0 to another cpupool.
This is because pCPUs outside of any pool are considered
part of cpupool0. This puts us at risk of crash when those
same pCPUs are added to another pool and something
different than the idle domain is found to be running
on them.
Note that, even if we prevent the above to happen (which
is the purpose of this patch), this is still pretty bad,
in fact, when we remove a pCPU from Pool-0:
- in Credit1, as we do *not* update prv->ncpus and
prv->credit, which means we're considering the wrong
total credits when doing accounting;
- in Credit2, the pCPU remains part of one runqueue,
and is hence at least considered during load balancing,
even if no vCPU should really run there.
In Credit1, this "only" causes skewed accounting and
no crashes because there is a lot of `cpumask_and`ing
going on with the cpumask of the domains' cpupool
(which, BTW, comes at a price).
A quick and not to involved (and easily backportable)
solution for Credit2, is to do exactly the same.
Andrew Cooper [Mon, 20 Feb 2017 15:00:20 +0000 (16:00 +0100)]
x86/VT-x: Dump VMCS on VMLAUNCH/VMRESUME failure
If a VMLAUNCH/VMRESUME fails due to invalid control or host state, dump the
VMCS before crashing the domain.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d0fd9ae54491328b10dee4003656c14b3bf3d3e9
master date: 2016-07-04 10:51:48 +0100
There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.
So, always call teardown callback to roll back actions
that were performed in init callback.
This is XSA-207.
Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jan Beulich <jbeulich@suse.com> Tested-by: Julien Grall <julien.grall@arm.com>
George Dunlap [Thu, 9 Feb 2017 09:32:29 +0000 (10:32 +0100)]
x86/emulate: don't assume that addr_size == 32 implies protected mode
Callers of x86_emulate() generally define addr_size based on the code
segment. In vm86 mode, the code segment is set by the hardware to be
16-bits; but it is entirely possible to enable protected mode, set the
CS to 32-bits, and then disable protected mode. (This is commonly
called "unreal mode".)
But the instruction decoder only checks for protected mode when
addr_size == 16. So in unreal mode, hardware will throw a #UD for VEX
prefixes, but our instruction decoder will decode them, triggering an
ASSERT() further on in _get_fpu(). (With debug=n the emulator will
incorrectly emulate the instruction rather than throwing a #UD, but
this is only a bug, not a crash, so it's not a security issue.)
Teach the instruction decoder to check that we're in protected mode,
even if addr_size is 32.
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Split real mode and VM86 mode handling, as VM86 mode is strictly 16-bit
at all times. Re-base.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 05118b1596ffe4559549edbb28bd0124a7316123
master date: 2017-01-25 15:09:55 +0100
Dario Faggioli [Thu, 9 Feb 2017 09:32:03 +0000 (10:32 +0100)]
xen: credit2: fix shutdown/suspend when playing with cpupools.
In fact, during shutdown/suspend, we temporarily move all
the vCPUs to the BSP (i.e., pCPU 0, as of now). For Credit2
domains, we call csched2_vcpu_migrate(), expects to find the
target pCPU in the domain's pool
Therefore, if Credit2 is the default scheduler and we have
removed pCPU 0 from cpupool0, shutdown/suspend fails like
this:
****************************************
Panic on CPU 8:
Assertion 'svc->vcpu->processor < nr_cpu_ids' failed at sched_credit2.c:1729
****************************************
On the other hand, if Credit2 is the scheduler of another
pool, when trying (still during shutdown/suspend) to move
the vCPUs of the Credit2 domains to pCPU 0, it figures
out that pCPU 0 is not a Credit2 pCPU, and fails like this:
The solution is to recognise the specific situation, inside
csched2_vcpu_migrate() and, considering it is something temporary,
which only happens during shutdown/suspend, quickly deal with it.
Then, in the resume path, in restore_vcpu_affinity(), things
are set back to normal, and a new v->processor is chosen, for
each vCPU, from the proper set of pCPUs (i.e., the ones of
the proper cpupool).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit2: non Credit2 pCPUs are ok during shutdown/suspend.
Commit 7478ebe1602e6 ("xen: credit2: fix shutdown/suspend
when playing with cpupools"), while doing the right thing
for actual code, forgot to update the ASSERT()s accordingly,
in csched2_vcpu_migrate().
In fact, as stated there already, during shutdown/suspend,
we must allow a Credit2 vCPU to temporarily migrate to a
non Credit2 BSP, without any ASSERT() triggering.
Move them down, after the check for whether or not we are
shutting down, where the assumption that the pCPU must be
valid Credit2 ones, is valid.
Dario Faggioli [Thu, 9 Feb 2017 09:31:26 +0000 (10:31 +0100)]
xen: credit2: use the correct scratch cpumask.
In fact, there is one scratch mask per each CPU. When
you use the one of a CPU, it must be true that:
- the CPU belongs to your cpupool and scheduler,
- you own the runqueue lock (the one you take via
{v,p}cpu_schedule_lock()) for that CPU.
This was not the case within the following functions:
get_fallback_cpu(), csched2_cpu_pick(): as we can't be
sure we either are on, or hold the lock for, the CPU
that is in the vCPU's 'v->processor'.
migrate(): it's ok, when called from balance_load(),
because that comes from csched2_schedule(), which takes
the runqueue lock of the CPU where it executes. But it is
not ok when we come from csched2_vcpu_migrate(), which
can be called from other places.
The fix is to explicitly use the scratch space of the
CPUs for which we know we hold the runqueue lock.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reported-by: Jan Beulich <JBeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 548db8742872399936a2090cbcdfd5e1b34fcbcc
master date: 2017-01-24 17:02:07 +0000
Joao Martins [Thu, 9 Feb 2017 09:30:49 +0000 (10:30 +0100)]
x86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed
Commit 6e03363 ("x86: Implement TSC adjust feature for HVM guest")
implemented TSC_ADJUST MSR for hvm guests. Though while booting
an HVM guest the boot CPU would have a value set with delta_tsc -
guest tsc while secondary CPUS would have 0. For example one can
observe:
$ xen-hvmctx 17 | grep tsc_adjust
TSC_ADJUST: tsc_adjust ff9377dfef47fe66
TSC_ADJUST: tsc_adjust 0
TSC_ADJUST: tsc_adjust 0
TSC_ADJUST: tsc_adjust 0
Upcoming Linux 4.10 now validates whether this MSR is correct and
adjusts them accordingly under the following conditions: values of < 0
(our case for CPU 0) or != 0 or values > 7FFFFFFF. In this conditions it
will force set to 0 and for the CPUs that the value doesn't match all
together. If this msr is not correct we would see messages such as:
And on HVM guests supporting TSC_ADJUST (requiring at least Haswell
Intel) it won't boot.
Our current vCPU 0 values are incorrect and according to Intel SDM which on
section "Time-Stamp Counter Adjustment" states that "On RESET, the value
of the IA32_TSC_ADJUST MSR is 0." hence we should set it 0 and be
consistent across multiple vCPUs. Perhaps this MSR should be only
changed by the guest which already happens through
hvm_set_guest_tsc_adjust(..) routines (see below). After this patch
guests running Linux 4.10 will see a valid IA32_TSC_ADJUST msr of value
0 for all CPUs and are able to boot.
On the same section of the spec ("Time-Stamp Counter Adjustment") it is
also stated:
"If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR
adds (or subtracts) value X from the TSC, the logical processor also
adds (or subtracts) value X from the IA32_TSC_ADJUST MSR.
Unlike the TSC, the value of the IA32_TSC_ADJUST MSR changes only in
response to WRMSR (either to the MSR itself, or to the
IA32_TIME_STAMP_COUNTER MSR). Its value does not otherwise change as
time elapses. Software seeking to adjust the TSC can do so by using
WRMSR to write the same value to the IA32_TSC_ADJUST MSR on each logical
processor."
This suggests these MSRs values should only be changed through guest i.e.
throught write intercept msrs. We keep IA32_TSC MSR logic such that writes
accomodate adjustments to TSC_ADJUST, hence no functional change in the
msr_tsc_adjust for IA32_TSC msr. Though, we do that in a separate routine
namely hvm_set_guest_tsc_msr instead of through hvm_set_guest_tsc(...).
Jan Beulich [Thu, 9 Feb 2017 09:30:16 +0000 (10:30 +0100)]
x86: segment attribute handling adjustments
Null selector loads into SS (possible in 64-bit mode only, and only in
rings other than ring 3) must not alter SS.DPL. (This was found to be
an issue on KVM, and fixed in Linux commit 33ab91103b.)
Further arch_set_info_hvm_guest() didn't make sure that the ASSERT()s
in hvm_set_segment_register() wouldn't trigger: Add further checks, but
tolerate (adjust) clear accessed (CS, SS, DS, ES) and busy (TR) bits.
Finally the setting of the accessed bits for user segments was lost by
commit dd5c85e312 ("x86/hvm: Reposition the modification of raw segment
data from the VMCB/VMCS"), yet VMX requires them to be set for usable
segments. Add respective ASSERT()s (the only path not properly setting
them was arch_set_info_hvm_guest()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 366ff5f1b3252f9069d5aedb2ffc2567bb0a37c9
master date: 2017-01-20 14:39:12 +0100
Jan Beulich [Thu, 9 Feb 2017 09:29:44 +0000 (10:29 +0100)]
x86emul: LOCK check adjustments
BT, being encoded as DstBitBase just like BT{C,R,S}, nevertheless does
not write its (register or memory) operand and hence also doesn't allow
a LOCK prefix to be used.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2d4f4ba80de8a03a1b0f300d271715a88a8433d
master date: 2017-01-20 14:37:33 +0100
Jan Beulich [Thu, 9 Feb 2017 09:29:14 +0000 (10:29 +0100)]
x86emul: VEX.B is ignored in compatibility mode
While VEX.R and VEX.X are guaranteed to be 1 in compatibility mode
(and hence a respective mode_64bit() check can be dropped), VEX.B can
be encoded as zero, but would be ignored by the processor. Since we
emulate instructions in 64-bit mode (except possibly in the test
harness), we need to force the bit to 1 in order to not act on the
wrong {X,Y,Z}MM register (which has no bad effect on 32-bit test
harness builds, as there the bit would again be ignored by the
hardware, and would by default be expected to be 1 anyway).
We must not, however, fiddle with the high bit of VEX.VVVV in the
decode phase, as that would undermine the checking of instructions
requiring the field to be all ones independent of mode. This is
being enforced in copy_REX_VEX() instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86emul: correct VEX/XOP/EVEX operand size handling for 16-bit code
Operand size defaults to 32 bits in that case, but would not have been
set that way in the absence of an operand size override.
Reported-by: Wei Liu <wei.liu2@citrix.com> (by AFL fuzzing) Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 89c76ee7f60777b81c8fd0475a6af7c84e72a791
master date: 2017-01-17 10:32:25 +0100
master commit: beb82042447c5d6e7073d816d6afc25c5a423cde
master date: 2017-01-25 15:08:59 +0100
Andrew Cooper [Thu, 9 Feb 2017 09:28:28 +0000 (10:28 +0100)]
x86/xstate: Fix array overrun on hardware with LWP
c/s da62246e4c "x86/xsaves: enable xsaves/xrstors/xsavec in xen" introduced
setup_xstate_features() to allocate and fill xstate_offsets[] and
xstate_sizes[].
However, fls() casts xfeature_mask to 32bits which truncates LWP out of the
calculation. As a result, the arrays are allocated too short, and the cpuid
infrastructure reads off the end of them when calculating xstate_size for the
guest.
On one test system, this results in 0x3fec83c0 being returned as the maximum
size of an xsave area, which surprisingly appears not to bother Windows or
Linux too much. I suspect they both use current size based on xcr0, which Xen
forwards from real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe0d67576e335c02becf1cea8e67005509fa90b6
master date: 2017-01-16 17:37:26 +0000
Wei Liu [Thu, 29 Dec 2016 16:36:31 +0000 (16:36 +0000)]
libxl: fix libxl_set_memory_target
Commit 26dbc93a ("libxl: Remove pointless hypercall from
libxl_set_memory_target") removed the call to xc_domain_getinfolist, but
it failed to notice that "info" was actually needed later.
Put that back. While at it, make the code conform to coding style
requirement.
Roger Pau Monne [Mon, 19 Dec 2016 15:02:03 +0000 (15:02 +0000)]
init/FreeBSD: fix xencommons so it can only be launched by Dom0
At the moment the execution of xencommons is gated on the presence of the
privcmd device, but that's not correct, since privcmd is available to all Xen
domains (privileged or unprivileged). Instead of using privcmd use the
xenstored device, which will only be available to the domain that's in charge
of running xenstored, and thus xencommons.
Roger Pau Monne [Mon, 19 Dec 2016 15:02:01 +0000 (15:02 +0000)]
init/FreeBSD: set correct PATH for xl devd
FreeBSD init scripts don't have /usr/local/{bin/sbin} in it's PATH, which
prevents `xl devd` from working properly since hotplug scripts require the set
of xenstore cli tools to be in PATH.
While there also fix the usage of --pidfile, which according to the xl help
doesn't use "=", and add braces around XLDEVD_PIDFILE.
Julien Grall [Wed, 18 Jan 2017 18:54:08 +0000 (18:54 +0000)]
xen/arm: gic-v3: Make sure read from ICC_IAR1_EL1 is visible on the redistributor
"The effects of reading ICC_IAR0_EL1 and ICC_IAR1_EL1 on the state of a
returned INTID are not guaranteed to be visible until after the execution
of a DSB".
Because of the GIC is an external component, a dsb sy is required.
Without it the sysreg read may not have been made visible on the
redistributor.
Andrew Cooper [Wed, 18 Jan 2017 09:11:50 +0000 (10:11 +0100)]
x86/emul: Correct the return value handling of VMFUNC
The bracketing of x86_emulate() calling the ops->vmfunc() hook is wrong with
respect to the assignment to rc, which can trip the new assertions in
x86_emulate_wrapper().
The hvmemul_vmfunc() hook should only raise #UD if X86EMUL_EXCEPTION is
returned. This is only a latent bug at the moment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3ab1876504d409689824e161a8b04e57e1e5dd46
master date: 2016-12-22 13:32:46 +0000
Jan Beulich [Wed, 18 Jan 2017 09:10:43 +0000 (10:10 +0100)]
VT-d: correct dma_msi_set_affinity()
Commit 83cd2038fe ("VT-d: use msi_compose_msg()) together with 15aa6c6748 ("amd iommu: use base platform MSI implementation"),
introducing the use of a per-CPU scratch CPU mask, went too far:
dma_msi_set_affinity() may, at least in theory, be called in
interrupt context, and hence the use of that scratch variable is not
correct.
Since the function overwrites the destination information anyway,
allow msi_compose_msg() to be called with a NULL CPU mask, avoiding
the use of that scratch variable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7f885a1f49a75c770360b030666a5c1545156e5c
master date: 2016-12-16 14:33:43 +0100
Luwei Kang [Wed, 18 Jan 2017 09:09:32 +0000 (10:09 +0100)]
x86/VPMU: clear the overflow status of which counter happened to overflow
Just set the corresponding bits of counters which happened to overflow,
rather than setting all the available bits of IA32_PERF_GLOBAL_OVF_CTRL
when pmu interrupt happened.
Signed-off-by: Luwei Kang <luwei.kang@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7a0c70482580234868fcc53b8d72e31966dc7c52
master date: 2016-12-13 14:21:26 +0100
Jan Beulich [Wed, 18 Jan 2017 09:02:39 +0000 (10:02 +0100)]
x86/EFI: meet further spec requirements for runtime calls
So far we didn't guarantee 16-byte alignment of the stack: While (so
far) we don't tell the compiler to use smaller alignment, we also don't
guarantee 16-byte alignment when establishing stack pointers for new
vCPU-s. Runtime service functions using SSE instructions may end with
#GP(0) without that.
Note that making use of -mpreferred-stack-boundary=3, as mentioned in
the comment, wouldn't help to reduce the needed alignment: The compiler
would then be free to align the stack of the function with the aligned
object, but would be permitted to place an odd number of 8-byte objects
there, resulting in the callee to still run on an unaligned stack.
(The only working alternative to the approach chosen here would be to
use -mincoming-stack-boundary=3, but that would affect all functions in
runtime.c, not just the ones actually making runtime services calls.
And it would still require the manual alignment logic here to be used
with gcc 5.2 and earlier - not permitting that command line option -,
just that then the alignment amount would become conditional.)
Hence enforce the needed alignment by making efi_rs_enter() return a
suitably aligned structure, which the caller then necessarily has to
store in a suitably aligned local variable, the address of which then
gets passed to efi_rs_leave(). Also (to limit exposure) move the
function declarations to where they belong: They're local to runtime.c,
and shared only with compat.c (by the latter including the former).
Furthermore we should avoid #MF to be raised on the FLDCW we do.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f6b7fedc896250028cb81dafe9a3f6773aaf1da2
master date: 2016-11-22 13:52:53 +0100
Andrew Cooper [Wed, 18 Jan 2017 09:01:10 +0000 (10:01 +0100)]
x86/svm: Fix svm_nextrip_insn_length() when crossing the virtual boundary to 0
vmcb->nextrip can legitimately be less than vmcb->rip when execution wraps
back around to 0. Instead, complain if the reported length is greater than 15
and use x86_decode_insn() as a fallback.
While making changes here, fix two whitespace issues with the case labels.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
x86/hvm: Fix non-debug build folling c/s 0745f665a5
Andrew Cooper [Wed, 18 Jan 2017 09:00:31 +0000 (10:00 +0100)]
x86/traps: Don't call hvm_hypervisor_cpuid_leaf() for PV guests
Luckily, hvm_hypervisor_cpuid_leaf() and vmx_hypervisor_cpuid_leaf() are safe
to execute in the context of a PV guest, but HVM-specific feature flags
shouldn't be visible to PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f43883193da76fc928e836e319c3172f394e0f3
master date: 2016-11-16 10:33:18 +0000
Andrew Cooper [Wed, 18 Jan 2017 09:00:02 +0000 (10:00 +0100)]
x86/vmx: Correct the long mode check in vmx_cpuid_intercept()
%cs.L may be set in a legacy mode segment, or clear in a compatibility mode
segment; it is not the correct way to check for long mode being active.
Both of these situations result in incorrect visibility of the SYSCALL feature
in CPUID, and by extension, incorrect behaviour in hvm_efer_valid().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fcb618c025f9251d7e22138f6528595037252c21
master date: 2016-11-16 10:32:54 +0000
Andrew Cooper [Wed, 18 Jan 2017 08:59:26 +0000 (09:59 +0100)]
x86/svm: Don't clobber eax and edx if an RDMSR intercept fails
The original code has a bug; eax and edx get unconditionally updated even when
hvm_msr_read_intercept() doesn't return X86EMUL_OKAY.
It is only by blind luck (vmce_rdmsr() eagerly initialising its msr_content
pointer) that this isn't an information leak into guests.
While fixing this bug, reduce the scope of msr_content and initialise it to 0.
This makes it obvious that a stack leak won't occur, even if there were to be
a buggy codepath in hvm_msr_read_intercept().
Also make some non-functional improvements. Make the insn_len calculation
common, and reduce the quantity of explicit casting by making better use of
the existing register names.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: a0b4e3c0681a11b765fe218fba0ba4ebb9fa56c5
master date: 2016-11-10 15:34:42 +0000
Andrew Cooper [Wed, 18 Jan 2017 08:57:30 +0000 (09:57 +0100)]
x86/emul: Correct the decoding of SReg3 operands
REX.R is ignored when considering segment register operands, and needs masking
out first.
While fixing this, reorder the user segments in x86_segment to match SReg3
encoding. This avoids needing a translation table between hardware ordering
and Xen's ordering.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
VMX: fix realmode emulation SReg handling
Commit 0888d36bb2 ("x86/emul: Correct the decoding of SReg3 operands")
overlooked three places where x86_seg_cs was assumed to be zero.
Jan Beulich [Wed, 21 Dec 2016 16:42:49 +0000 (17:42 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests
Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.
This is CVE-2016-10024 / XSA-202.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100
Juergen Gross [Tue, 13 Dec 2016 13:25:52 +0000 (14:25 +0100)]
pvgrub: fix crash when booting kernel with p2m list outside kernel mapping
When trying to boot a kernel with the p2m list not mapped by the
initial kernel mapping it can happen that pvgrub is failing as it is
keeping some page tables mapped.
Unmap the additional page tables created for the special p2m mapping
will avoid this failure.
Reported-by: Sven Koehler <sven.koehler@gmail.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 9714f6b87e19b32d3a6663a20df6610265c4bfe5
master date: 2016-09-28 11:29:28 +0100
Jan Beulich [Tue, 13 Dec 2016 13:25:03 +0000 (14:25 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix
Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).
Wei Chen [Tue, 29 Nov 2016 15:10:09 +0000 (16:10 +0100)]
arm32: handle async aborts delivered while at HYP
If guest generates an asynchronous abort and then traps into HYP
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.
In order to avoid this, it is necessary to take the abort at HYP, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
Guest -> HYP switch paths. If we caught such asynchronous abort in
checking window, the HYP data abort exception will be triggered and
the abort source guest will be crashed.
Wei Chen [Tue, 29 Nov 2016 15:09:20 +0000 (16:09 +0100)]
arm64: handle async aborts delivered while at EL2
If EL1 generates an asynchronous abort and then traps into EL2
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.
In order to avoid this, it is necessary to take the abort at EL2, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
EL1 -> EL2 swich paths. If we catched such asynchronous abort in
checking window, the hyp_error exception will be triggered and the
abort source guest will be crashed.
In current code, when the hypervisor receives an asynchronous abort
from a guest, the hypervisor will do panic, the host will be down.
We have to prevent such security issue, so, in this patch we crash
the guest, when the hypervisor receives an asynchronous abort from
the guest.
Ian Jackson [Tue, 22 Nov 2016 13:16:13 +0000 (14:16 +0100)]
pygrub: Properly quote results, when returning them to the caller:
* When the caller wants sexpr output, use `repr()'
This is what Xend expects.
The returned S-expressions are now escaped and quoted by Python,
generally using '...'. Previously kernel and ramdisk were unquoted
and args was quoted with "..." but without proper escaping. This
change may break toolstacks which do not properly dequote the
returned S-expressions.
* When the caller wants "simple" output, crash if the delimiter is
contained in the returned value.
With --output-format=simple it does not seem like this could ever
happen, because the bootloader config parsers all take line-based
input from the various bootloader config files.
With --output-format=simple0, this can happen if the bootloader
config file contains nul bytes.
This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:15:48 +0000 (14:15 +0100)]
x86/svm: fix injection of software interrupts
The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual. The manual was later corrected, following findings from that series.
I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use. It turns out that this
is not the case, and the new wording is clearer on the matter.
Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.
During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors. Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.
Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.
Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.
This is CVE-2016-9378 / part of XSA-196.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:15:17 +0000 (14:15 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()
The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy. The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.
In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking. In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.
While here, fix a minor issue with reading the IDT entry. The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs. This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.
This is CVE-2016-9377 / part of XSA-196.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100
Jan Beulich [Tue, 22 Nov 2016 13:14:27 +0000 (14:14 +0100)]
x86emul: fix huge bit offset handling
We must never chop off the high 32 bits.
This is CVE-2016-9383 / XSA-195.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100
Roger Pau Monné [Tue, 22 Nov 2016 13:13:55 +0000 (14:13 +0100)]
libelf: fix stack memory leak when loading 32 bit symbol tables
The 32 bit Elf structs are smaller than the 64 bit ones, which means that
when loading them there's some padding left uninitialized at the end of each
struct (because the size indicated in e_ehsize and e_shentsize is
smaller than the size of elf_ehdr and elf_shdr).
Fix this by introducing a new helper that is used to set
[caller_]xdest_{base/size} and that takes care of performing the appropriate
memset of the region. This newly introduced helper is then used to set and
unset xdest_{base/size} in elf_load_bsdsyms. Now that the full struct
is zeroed, there's no need to specifically zero the undefined section.
This is CVE-2016-9384 / XSA-164.
Suggested-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Also remove the open coded (and redundant with the earlier
elf_memset_unchecked()) use of caller_xdest_* from elf_init().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: fb08f7d009a64b96efa4462c9d19ed6881936859
master date: 2016-11-22 13:48:30 +0100
Jan Beulich [Tue, 22 Nov 2016 13:13:25 +0000 (14:13 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses
Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.
Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).
The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.
This is CVE-2016-9385 / XSA-193.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100
Jan Beulich [Tue, 22 Nov 2016 13:12:53 +0000 (14:12 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch
Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.
This is CVE-2016-9382 / XSA-192.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:12:18 +0000 (14:12 +0100)]
x86/hvm: Fix the handling of non-present segments
In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use. In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use. However, nothing in Xen
actually checks for this condition when performing other segmentation
checks. (Note however that limit and writeability checks are correctly
performed).
Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified. Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.
The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache. The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.
GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.
AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present. They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.
Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.
This is CVE-2016-9386 / XSA-191.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100
Jan Beulich [Tue, 25 Oct 2016 15:13:09 +0000 (17:13 +0200)]
x86: MISALIGNSSE feature depends on SSE
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: b4ca886ab18ddcc729c1bc3d730ab078508d7ce3
master date: 2016-10-24 17:34:17 +0200
Jan Beulich [Tue, 25 Oct 2016 15:12:05 +0000 (17:12 +0200)]
x86/Viridian: don't depend on undefined register state
The high halves of all GPRs are undefined in 32-bit and compat modes,
and the dependency is being obfuscated by our structure field names not
matching architectural register names (it was actually while putting
together a patch to correct this when I noticed the issue here).
For consistency also use the architecturally correct names on the
output side.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: a709a3a646302e95ba42beac89264f6cdacd0c64
master date: 2016-10-14 14:09:42 +0200
Jan Beulich [Tue, 25 Oct 2016 15:11:37 +0000 (17:11 +0200)]
x86emul: fix pushing of selector registers
Both explicit PUSH and far CALL currently push unrelated data (the
segment attributes word) in the high half (attributes and limit in the
64-bit case in the high 48 bits) instead of zero. To avoid having to
apply this and further changes in multiple places, also fold the two
(respectively) far call/jmp instances into one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 373923ed9c2ed36925f574387db2be2ebe5ce45a
master date: 2016-10-14 14:09:16 +0200
Changeset fbf96e6, "xentrace: correct formula to calculate
t_info_pages", broke the trace metadata page count calculation, by
mistaking t_info_first_offset as denominated in bytes, when in fact it
is denominated in words (uint32_t).
Effectively revert that change, and put a comment there to reduce the
chance that someone will make that mistake in the future.
Reviewed-by: Igor Druzhinin <igor.druzhinin@citrix.com> Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
master commit: 71b8b46111219a2f83f4f9ae06ac5409744ea86e
master date: 2016-10-11 14:23:52 +0100
Jan Beulich [Tue, 25 Oct 2016 15:10:05 +0000 (17:10 +0200)]
x86: defer not-present segment checks
Following on from commits 5602e74c60 ("x86emul: correct loading of
%ss") and bdb860d01c ("x86/HVM: correct segment register loading during
task switch") the point of the non-.present checks needs to be refined:
#NP (and its #SS companion), other than suggested by the various
instruction pages in Intel's SDM, gets checked for only after all type
and permission checks. The only checks getting done even later are the
long mode specific ones for system descriptors (which we don't support
yet) and 64-bit code segments (i.e. anything touching other than the
attribute byte).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/hvm: Correct the position of the %cs L/D checks
Contrary to the description in the software manuals, in Long Mode, attempts to
load %cs check that D is not set in combination with L before the present flag
is checked.
This can be observed because the L/D check fails with #GP before the presence
check failes with #NP.
This change partially reverts c/s 78ff18c90 "x86: defer not-present segment
checks", taking it back to how it was in the v1 submission.
Dario Faggioli [Tue, 25 Oct 2016 15:08:58 +0000 (17:08 +0200)]
xen: credit1: return the 'time remaining to the limit' as next timeslice.
If vcpu x has run for 200us, and sched_ratelimit_us is
1000us, continue running x _but_ return 1000us-200us as
the next time slice. This way, next scheduling point will
happen in 800us, i.e., exactly at the point when x crosses
the threshold, and can be descheduled (if appropriate).
Right now (without this patch), we're always returning
sched_ratelimit_us (1000us, in the example above), which
means we're (potentially) allowing x to run more than
it should have been able to.
Note that, however, in order to avoid setting timers to very
short intervals, which is part of the purpose of rate limiting,
we never use a time slice smaller than a well defined threshold.
Such threshold (CSCHED_MIN_TIMER defined in this patch) is, in
general independent from rate limiting, but it looks a good idea
to set it to the minimum possible ratelimiting value.
Jan Beulich [Tue, 4 Oct 2016 13:04:46 +0000 (14:04 +0100)]
x86emul: honor guest CR0.TS and CR0.EM
We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.
Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.
This is XSA-190.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
AMD F12h machines have an erratum which can cause DIV/IDIV to behave
unpredictably. The workaround is to set MSRC001_1029[31] but sometimes
there is no BIOS update containing that workaround so let's do it
ourselves unconditionally. It is simple enough.
Jan Beulich [Wed, 28 Sep 2016 14:55:20 +0000 (16:55 +0200)]
x86emul: correct loading of %ss
- Instead of #NP, #SS needs to be raised for non-present descriptors.
- Loading a null selector is fine in 64-bit mode at CPL != 3, as long
as RPL == CPL.
- Don't lose the low two selector bits on null selector loads (also
applies to %ds, %es, %fs, %gs, and LDTR).
Since we need CPL earlier now, also switch to using get_cpl() (instead
of open coding it).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5602e74c60c1ec2feef4cdd75376e4b1a1d7e681
master date: 2016-09-26 17:26:21 +0200
Jan Beulich [Wed, 28 Sep 2016 14:52:04 +0000 (16:52 +0200)]
x86/Intel: hide CPUID faulting capability from guests
We don't currently emulate it, so guests should not be misguided to
believe they can (try to) use it.
For now, simply return zero to guests for platform MSR reads, and only
accept (by discarding) writes of zero. If ever there will be bits we
can safely expose to guests, let's handle them by white listing.
(As a side note - according to SDM version 059 bit 31 is reserved on
all known families.)
Reported-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: b982a5bea4273a4b9fc007d5046bed8d1669c07f
master date: 2016-09-19 11:37:09 +0200
xen: credit2: properly schedule migration of a running vcpu.
If wanting to migrate a vcpu that is actually running,
we need to ask the scheduler to chime in as soon as
possible, to have the vcpu itself stopped and actually
moved.
Make sure this happens by, after setting all the relevant
flags, raising the scheduler softirq.
xen: credit1: fix mask to be used for tickling in Credit1
If there are idle pcpus inside the waking vcpu's
soft-affinity mask, we should really tickle one
of them (this is one of the purposes of the
__runq_tickle() function itself!), not just
any idle pcpu.
The issue has been introduced in 02ea5031825d
("credit1: properly deal with pCPUs not in any cpupool"),
where the usage of idle_mask is changed, without
updating the bottom of the function, where it
is also referenced.
Andrew Cooper [Wed, 28 Sep 2016 14:50:24 +0000 (16:50 +0200)]
x86/domctl: Fix migration of guests which are not using xsave
c/s da62246e "x86/xsaves: enable xsaves/xrstors/xsavec in xen" broke migration
of PV guests which were not using xsave.
In such a case, compress_xsave_states() gets passed a zero length buffer. The
first thing it tries to do is ASSERT() on user-provided data, if it hadn't
already wandered off the end of the buffer to do so.
Perform more verification of the input buffer before passing it to
compress_xsave_states(). This involves making xsave_area_compressed() public.
Similar problems exist on the HVM side, so make equivalent adjustments there.
This doesn't manifest in general, as hvm_save_cpu_xsave_states() elides the
entire record if xsave isn't used, but is a problem if a caller were to
construct an xsave record manually.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com>
master commit: 681aea049c4a83bb847918003dc2ae21c1156ddb
master date: 2016-09-13 10:44:04 +0100
Andrew Cooper [Wed, 28 Sep 2016 14:49:26 +0000 (16:49 +0200)]
x86/domctl: Fix TOCTOU race with the use of XEN_DOMCTL_getvcpuextstate
A toolstack must call XEN_DOMCTL_getvcpuextstate twice; first to find the size
of the buffer to use, and a second time to get the actual content.
The reported size was based on v->arch.xcr0_accum, but a guest which extends
its xcr0_accum between the two hypercalls will cause the toolstack to fail the
evc->size != size check, as the provided buffer is now too small. This causes
a hard error during the final phase of migration.
Instead, return a size based on xfeature_mask, which is the maximum size Xen
will ever permit. The hypercall must now tolerate a toolstack-provided buffer
which is overly large (for the case where a guest isn't using all available
xsave states), and should write back how much data was actually written into
the buffer.
As the query for size now has no dependence on vcpu state, the vcpu_pause()
can be omitted for a small performance improvement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4a322557ae98cccdf90a0f442a29e1f5d76378a
master date: 2016-09-13 10:43:59 +0100
libxl: do not assume Dom0 backend while getting nic info
Fill backend_domid field based on backend path.
[ Correspoding xen.git#master commit is 7539772a65b0 -iwj ]
Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Andrew Cooper [Thu, 1 Sep 2016 09:45:03 +0000 (10:45 +0100)]
tools/migrate: Prevent PTE truncation from being fatal duing the live phase
It is possible, when normalising a PV pagetable that the table has been freed
and reused for something else by the guest.
In such a case, data read might no longer be a pagetable, and fail the
truncation check. However, this should only be fatal if we encounter such a
page in the paused phase.
This check is now consistent with all other checks in the same area.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit a335a58e757232249d7f33503ab5d8f849518a47)
Juergen Gross [Fri, 26 Aug 2016 11:58:55 +0000 (13:58 +0200)]
libxc: correct max_pfn calculation for saving domain
Commit 91e204d37f44913913776d0a89279721694f8b32 ("libxc: try to find
last used pfn when migrating") introduced a bug for the case of a
domain supporting the virtual mapped linear p2m list: the maximum pfn
of the domain calculated from the p2m memory allocation might be too
low.
Correct this.
Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 9daed8321b44c3ca82e412eb130f84e6b6c17dc5)