Andrew Cooper [Tue, 23 Sep 2014 12:49:35 +0000 (14:49 +0200)]
x86/shadow: fix race condition sampling the dirty vram state
d->arch.hvm_domain.dirty_vram must be read with the domain's paging lock held.
If not, two concurrent hypercalls could both end up attempting to free
dirty_vram (the second of which will free a wild pointer), or both end up
allocating a new dirty_vram structure (the first of which will be leaked).
This is XSA-104.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 46a49b91f1026f64430b84dd83e845a33f06415e
master date: 2014-09-23 14:31:47 +0200
Jan Beulich [Fri, 22 Aug 2014 12:13:27 +0000 (14:13 +0200)]
x86_emulate: properly do IP updates and other side effects on success
The two MMX/SSE/AVX code blocks failed to update IP properly, and these
as well as get_reg_refix(), which "manually" updated IP so far, failed
to do the TF and RF processing needed at the end of successfully
emulated instructions.
Fix the test utility at once to check IP is properly getting updated,
and while at it macroize the respective code quite a bit, hopefully
making it easier to add further tests when the need arises.
Reported-by: Andrei LUTAS <vlutas@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>
master commit: 3af450fd2d9403f208d3ac6459716f027b8597ad
master date: 2014-08-08 09:34:03 +0200
Andrew Cooper [Tue, 12 Aug 2014 14:10:01 +0000 (16:10 +0200)]
x86/cpu: undo BIOS CPUID max_leaf limit before querying for features
If IA32_MISC_ENABLE[22] is set by the BIOS, CPUID.0.EAX will be limited to 3.
Lift this limit before considering whether to query CPUID.7[ECX=0].EBX for
features.
Without this change, dom0 is able to see this feature leaf (as the limit was
subsequently lifted), and will set features appropriately in HVM domain cpuid
policies.
The specific bug XenServer observed was the advertisement of the FSGSBASE
feature, but an inability to set CR4.FSGSBASE as Xen considered the bit to be
reserved as cpu_has_fsgsbase incorrectly evaluated as false.
This is a regression introduced by c/s 44e24f8567 "x86: don't call
generic_identify() redundantly" where the redundant call actually resampled
CPUID.7[ECX=0] properly to obtain the feature flags.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a1ac4cf52e38386bac7ac3440c7da0099662ca5c
master date: 2014-07-29 17:02:25 +0200
Jan Beulich [Tue, 12 Aug 2014 14:09:12 +0000 (16:09 +0200)]
x86/paging: make log-dirty operations preemptible
Both the freeing and the inspection of the bitmap get done in (nested)
loops which - besides having a rather high iteration count in general,
albeit that would be covered by XSA-77 - have the number of non-trivial
iterations they need to perform (indirectly) controllable by both the
guest they are for and any domain controlling the guest (including the
one running qemu for it).
This is CVE-2014-5146 / XSA-97.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 95e6d82224689fdfd967a093a4d69efc24c17e91
master date: 2014-08-12 15:30:11 +0200
Robert Phillips [Tue, 12 Aug 2014 14:08:10 +0000 (16:08 +0200)]
x86/mm/hap: Adjust vram tracking to play nicely with log-dirty.
The previous code assumed the guest would be in one of three mutually exclusive
modes for bookkeeping dirty pages: (1) shadow, (2) hap utilizing the log dirty
bitmap to support functionality such as live migrate, (3) hap utilizing the
log dirty bitmap to track dirty vram pages.
Races arose when a guest attempted to track dirty vram while performing live
migrate. (The dispatch table managed by paging_log_dirty_init() might change
in the middle of a log dirty or a vram tracking function.)
This change allows hap log dirty and hap vram tracking to be concurrent.
Vram tracking no longer uses the log dirty bitmap. Instead it detects
dirty vram pages by examining their p2m type. The log dirty bitmap is only
used by the log dirty code. Because the two operations use different
mechanisms, they are no longer mutually exclusive.
Signed-Off-By: Robert Phillips <robert.phillips@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Minor whitespace changes to conform with coding style Signed-off-by: Tim Deegan <tim@xen.org>
master commit: fd91a2a662bc59677e0f217423a7a155d5465886
master date: 2012-12-13 12:10:14 +0000
This is a patch repairing a regression in code previously functional in 4.1.x.
It appears that, during some refactoring work, call to hvm_memory_event_cr0 was lost.
This function was originally called in mov_to_cr() of vmx.c, but the commit
http://xenbits.xen.org/hg/xen-unstable.hg/rev/1276926e3795 abstracted the
original code into generic functions up a level in hvm.c, dropping the call
in the process.
The same issue affected the CR3 and CR4 events, which were fixed in patch
http://xenbits.xensource.com/hg/xen-unstable.hg/rev/7ab899e46347.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 5d570c1d0274cac3b333ef378af3325b3b69905e
master date: 2014-07-23 18:05:11 +0200
avoid crash when doing shutdown with active cpupools
When shutting down the machine while there are cpus in a cpupool other than
Pool-0 a crash is triggered due to cpupool handling rejecting offlining the
non-boot cpus in other cpupools.
It is easy to detect this case and allow offlining those cpus.
Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Stefan Bader <stefan.bader@canonical.com>
master commit: 05377dede434c746e6708f055858378d20f619db
master date: 2014-07-23 18:03:19 +0200
For safety reasons, c/s 6ae2df93c27 "mem_access: Add helper API to setup
ring and enable mem_access" has to pause the domain while it performs a set of
operations.
However without properly reference counted hypercalls, xc_mem_event_enable()
now unconditionally unpauses a previously paused domain.
To prevent toolstack software running wild, there is an arbitrary limit of 255
on the toolstack pause count. This is high enough for several components of
the toolstack to safely use, but prevents over/underflow of d->pause_count.
The previous domain_{,un}pause_by_systemcontroller() functions are updated to
return an error code. domain_pause_by_systemcontroller() is modified to have
a common stub and take a pause_fn pointer, allowing for both sync and nosync
domain pauses. domain_pause_for_debugger() has a hand-rolled nosync pause
replaced with the new domain_pause_by_systemcontroller_nosync(), and has its
variables shuffled slightly to avoid rereading current multiple times.
Suggested-by: Don Slutz <dslutz@verizon.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
With a couple of formatting adjustments: Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/gdbsx: invert preconditions for XEN_DOMCTL_gdbsx_{,un}pausevcpu hypercalls
c/s 3eb1c708ab "properly reference count DOMCTL_{,un}pausedomain hypercalls"
accidentally inverted the use of d->controller_pause_count.
Revert back to how it was originally, i.e. the XEN_DOMCTL_gdbsx_{,un}pausevcpu
hypercalls are only valid for a domain already paused by the system controller.
Jan Beulich [Mon, 28 Jul 2014 13:09:47 +0000 (15:09 +0200)]
VT-d/ATS: correct and clean up dev_invalidate_iotlb()
While this was intended to only do cleanup (replace the two bogus
"ret |= " constructs, and a simple formatting correction), this now
also
- fixes the bit manipulations for size_order > 0
a) correct an off-by-one in the use of size_order for shifting (till
now double the requested size got invalidated)
b) in fact setting bit 12 and up if necessary (without which too
small a region might have got invalidated)
c) making them capable of dealing with regions of 4Gb size and up
- corrects the return value handling, such that a later iteration's
success won't clear an earlier iteration's error indication
- uses PCI_BDF2() instead of open coding it
- bail immediately on bad passed in invalidation type, rather than
repeatedly printing the same message for each ATS-capable device, at
once also no longer hiding that failure from the caller
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: fd33987ba27607c3cc7da258cf1d86d21beeb735
master date: 2014-06-30 15:57:40 +0200
Dario Faggioli [Fri, 20 Jun 2014 14:09:00 +0000 (16:09 +0200)]
blktap2: Fix two 'maybe uninitialized' variables
for which gcc 4.9.0 complains about, like this:
block-qcow.c: In function `get_cluster_offset':
block-qcow.c:431:3: error: `tmp_ptr' may be used uninitialized in this function
[-Werror=maybe-uninitialized]
memcpy(tmp_ptr, l1_ptr, 4096);
^
block-qcow.c:606:7: error: `tmp_ptr2' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
if (write(s->fd, tmp_ptr2, 4096) != 4096) {
^
cc1: all warnings being treated as errors
/home/dario/Sources/xen/xen/xen.git/tools/blktap2/drivers/../../../tools/Rules.mk:89:
recipe for target 'block-qcow.o' failed
make[5]: *** [block-qcow.o] Error 1
The proper behavior is to return upon allocation failure.
About what to return, 0 seems the best option, looking
at both the function and the call sites.
Jan Beulich [Tue, 24 Jun 2014 08:24:52 +0000 (10:24 +0200)]
VT-d/qinval: make local variable used for communication with IOMMU "volatile"
Without that there is - afaict - nothing preventing the compiler from
putting the variable into a register for the duration of the wait loop.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: ceec46c02074e1b2ade0b13c3c4a2f3942ae698c
master date: 2014-06-20 10:25:33 +0200
Jan Beulich [Tue, 24 Jun 2014 08:24:20 +0000 (10:24 +0200)]
x86/EFI: allow FPU/XMM use in runtime service functions
UEFI spec update 2.4B developed a requirement to enter runtime service
functions with CR0.TS (and CR0.EM) clear, thus making feasible the
already previously stated permission for these functions to use some of
the XMM registers. Enforce this requirement (along with the connected
ones on FPU control word and MXCSR) by going through a full FPU save
cycle (if the FPU was dirty) in efi_rs_enter() (along with loading the
specified values into the other two registers).
Note that the UEFI spec mandates that extension registers other than
XMM ones (for our purposes all that get restored eagerly) are preserved
across runtime function calls, hence there's nothing we need to restore
in efi_rs_leave() (they do get saved, but just for simplicity's sake).
Malcolm Crossley [Tue, 24 Jun 2014 08:23:12 +0000 (10:23 +0200)]
IOMMU: prevent VT-d device IOTLB operations on wrong IOMMU
PCIe ATS allows for devices to contain IOTLBs, the VT-d code was iterating
around all ATS capable devices and issuing IOTLB operations for all IOMMUs,
even though each ATS device is only accessible via one particular IOMMU.
Issuing an IOMMU operation to a device not accessible via that IOMMU results
in an IOMMU timeout because the device does not reply. VT-d IOMMU timeouts
result in a Xen panic.
Therefore this bug prevents any Intel system with 2 or more ATS enabled IOMMUs,
each with an ATS device connected to them, from booting Xen.
The patch adds a IOMMU pointer to the ATS device struct so the VT-d code can
ensure it does not issue IOMMU ATS operations on the wrong IOMMU. A void
pointer has to be used because AMD and Intel IOMMU implementations do not have
a common IOMMU structure or indexing mechanism.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 84c340ba4c3eb99278b6ba885616bb183b88ad67
master date: 2014-06-18 15:50:02 +0200
x86/mce: don't spam the console with "CPUx: Temperature z"
If the machine has been quite busy it ends up with these messages
printed on the hypervisor console:
(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature above threshold
(XEN) CPU0: Running in modulated clock mode
(XEN) CPU1: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
While the state changes are important, the non-altered state
information is not needed. As such add a latch mechanism to only print
the information if it has changed since the last update (and the
hardware doesn't properly suppress redundant notifications).
This was observed on Intel DQ67SW,
BIOS SWQ6710H.86A.0066.2012.1105.1504 11/05/2012
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christoph Egger <chegger@amazon.de>
master commit: 323338f86fb6cd6f6dba4f59a84eed71b3552d21
master date: 2014-06-16 11:59:32 +0200
Jan Beulich [Tue, 24 Jun 2014 08:17:45 +0000 (10:17 +0200)]
x86/HVM: refine SMEP test in HVM_CR4_GUEST_RESERVED_BITS()
Andrew validly points out that the use of the macro on the restore path
can't rely on the CPUID bits for the guest already being in place (as
their setting by the tool stack in turn requires the other restore
operations already having taken place). And even worse, using
hvm_cpuid() is invalid here because that function assumes to be used in
the context of the vCPU in question.
Reverting to the behavior prior to the change from checking
cpu_has_sm?p to hvm_vcpu_has_sm?p() would break the other (non-restore)
use of the macro. So let's revert to the prior behavior only for the
restore path, by adding a respective second parameter to the macro.
Obviously the two cpu_has_* uses in the macro should really also be
converted to hvm_cpuid() based checks at least for the non-restore
path.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: David Vrabel <david.vrabel@citrix.com>
master commit: 584287380baf81e5acdd9dc7dfc7ffccd1e9a856
master date: 2014-06-10 13:12:05 +0200
Juergen Gross [Tue, 24 Jun 2014 08:16:12 +0000 (10:16 +0200)]
avoid crash on HVM domain destroy with PCI passthrough
c/s bac6334b5 "move domain to cpupool0 before destroying it" introduced a
problem when destroying a HVM domain with PCI passthrough enabled. The
moving of the domain to cpupool0 includes moving the pirqs to the cpupool0
cpus, but the event channel infrastructure already is unusable for the
domain. So just avoid moving pirqs for dying domains.
Roger Pau Monné [Tue, 24 Jun 2014 08:15:29 +0000 (10:15 +0200)]
x86: fix reboot/shutdown with running HVM guests
If there's a guest using VMX/SVM when the hypervisor shuts down, it
can lead to the following crash due to VMX/SVM functions being called
after hvm_cpu_down has been called. In order to prevent that, check in
{svm/vmx}_ctxt_switch_from that the cpu virtualization extensions are
still enabled.
Andrew Cooper [Tue, 24 Jun 2014 08:11:54 +0000 (10:11 +0200)]
x86/domctl: two functional fixes to XEN_DOMCTL_[gs]etvcpuextstate
Interacting with the vcpu itself should be protected by vcpu_pause().
Buggy/naive toolstacks might encounter adverse interaction with a vcpu context
switch, or increase of xcr0_accum. There are no much problems with current
in-tree code.
Explicitly permit a NULL guest handle as being a request for size. It is the
prevailing Xen style, and without it, valgrind's ioctl handler is unable to
determine whether evc->buffer actually got written to.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/domctl: further fix to XEN_DOMCTL_[gs]etvcpuextstate
Do not clobber errors from certain codepaths. Clobbering of -EINVAL from
failing "evc->size <= PV_XSAVE_SIZE(_xcr0_accum)" was a pre-existing bug.
However, clobbering -EINVAL/-EFAULT from the get codepath was a bug
unintentionally introduced by 090ca8c1 "x86/domctl: two functional fixes to
XEN_DOMCTL_[gs]etvcpuextstate".
Jan Beulich [Tue, 24 Jun 2014 08:10:13 +0000 (10:10 +0200)]
VT-d: honor APEI firmware-first mode in XSA-59 workaround code
When firmware-first mode is being indicated by firmware, we shouldn't
be modifying AER registers - these are considered to be owned by
firmware in that case. Violating this is being reported to result in
SMI storms. While circumventing the workaround means re-exposing
affected hosts to the XSA-59 issues, this in any event seems better
than not booting at all. Respective messages are being issued to the
log, so the situation can be diagnosed.
The basic building blocks were taken from Linux 3.15-rc. Note that
this includes a block of code enclosed in #ifdef CONFIG_X86_MCE - we
don't define that symbol, and that code also wouldn't build without
suitable machine check side code added; that should happen eventually,
but isn't subject of this change.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 1cc37ba8dbd89fb86dad3f6c78c3fba06019fe21
master date: 2014-06-05 17:49:14 +0200
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Use < instead of <= (which I wrongly suggested), return -ENODATA
instead of -EINVAL, and make description match code.
A failure would result in log message like so-
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
^^^^^^^^^^^^^^^^^^^^^^
The above message has the revision numbers inverted. Fix this.
Using %r11 with "cmpw" is invalid (which gas 2.25 is going to complain
about). Rather than rolling a branch specific fix, re-use the
respective hunk from master commit 4d246723 ("x86: use MOV instead of
PUSH/POP when saving/restoring register state").
Jan Beulich [Wed, 4 Jun 2014 15:20:38 +0000 (17:20 +0200)]
ACPI/ERST: fix table mapping
acpi_get_table(), when executed before reaching SYS_STATE_active, will
return a mapping valid only until the next invocation of that funciton.
Consequently storing the returned pointer for later use is incorrect.
Copy the logic used in VT-d's DMAR handling.
Jan Beulich [Tue, 3 Jun 2014 14:11:52 +0000 (16:11 +0200)]
x86/HVM: eliminate vulnerabilities from hvm_inject_msi()
- pirq_info() returns NULL for a non-allocated pIRQ, and hence we
mustn't unconditionally de-reference it, and we need to invoke it
another time after having called map_domain_emuirq_pirq()
- don't use printk(), namely without XENLOG_GUEST, for error reporting
Ross Lagerwall [Tue, 3 Jun 2014 10:23:25 +0000 (12:23 +0200)]
timers: set the deadline more accurately
Program the timer to the deadline of the closest timer if it is further
than 50us ahead, otherwise set it 50us ahead. This way a single event
fires on time rather than 50us late (as it would have previously) while
still preventing too many timer wakeups in the case of having many
timers scheduled close together.
Jan Beulich [Tue, 3 Jun 2014 10:22:49 +0000 (12:22 +0200)]
x86: don't use VA for cache flush when also flushing TLB
Doing both flushes at once is a strong indication for the address
mapping to either having got dropped (in which case the cache flush,
when done via INVLPG, would fault) or its physical address having
changed (in which case the cache flush would end up being done on the
wrong address range). There is no adverse effect (other than the
obvious performance one) using WBINVD in this case regardless of the
range's size; only map_pages_to_xen() uses combined flushes at present.
This problem was observed with the 2nd try backport of d6cb14b3 ("VT-d:
suppress UR signaling for desktop chipsets") to 4.2 (where ioremap()
needs to be replaced with set_fixmap_nocache(); the previously
commented out and now being re-enabled __set_fixmap(, 0, 0) there to
undo the mapping resulted in the first of the above two scenarios).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 50df6f7429f73364bbddb0970a3a34faa01a7790
master date: 2014-05-28 09:51:07 +0200
Jan Beulich [Tue, 3 Jun 2014 10:22:14 +0000 (12:22 +0200)]
AMD IOMMU: don't free page table prematurely
iommu_merge_pages() still wants to look at the next level page table,
the TLB flush necessary before freeing too happens in that function,
and if it fails no free should happen at all. Hence the freeing must
be done after that function returned successfully, not before it's
being called.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Tested-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 6b4d71d028f445cba7426a144751fddc8bfdd67b
master date: 2014-05-28 09:50:33 +0200
Jan Beulich [Tue, 3 Jun 2014 10:21:12 +0000 (12:21 +0200)]
VT-d: fix mask applied to DMIBAR in desktop chipset XSA-59 workaround
In commit ("VT-d: suppress UR signaling for desktop chipsets")
the mask applied to the value read from DMIBAR is to narrow, only the
comment accompanying it was correct. Fix that and tag the literal
number as "long long" at once to avoid eventual compiler warnings.
The widest possible value so far is 39 bits; all chipsets covered here
but having less than this number of bits have the remaining bits marked
reserved (zero), and hence there's no need for making the mask chipset
specific.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: f8ecf31c31906552522c2a1b0d1cada07d78876e
master date: 2014-05-26 12:28:46 +0200
Jan Beulich [Tue, 3 Jun 2014 10:19:45 +0000 (12:19 +0200)]
ACPI/ERST: fix table mapping
acpi_get_table(), when executed before reaching SYS_STATE_active, will
return a mapping valid only until the next invocation of that funciton.
Consequently storing the returned pointer for later use is incorrect.
Copy the logic used in VT-d's DMAR handling.
Juergen Gross [Fri, 23 May 2014 13:42:16 +0000 (15:42 +0200)]
move domain to cpupool0 before destroying it
Currently when a domain is destroyed it is removed from the domain_list
before all of it's resources, including the cpupool membership, are freed.
This can lead to a situation where the domain is still member of a cpupool
without for_each_domain_in_cpupool() (or even for_each_domain()) being
able to find it any more. This in turn can result in rejection of removing
the last cpu from a cpupool, because there seems to be still a domain in
the cpupool, even if it can't be found by scanning through all domains.
This situation can be avoided by moving the domain to be destroyed to
cpupool0 first and then remove it from this cpupool BEFORE deleting it from
the domain_list. As cpupool0 is always active and a domain without any cpupool
membership is implicitly regarded as belonging to cpupool0, this poses no
problem.
Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: bac6334b51d9bcfe57ecf4a4cb5288348fcf044a
master date: 2014-05-20 15:55:42 +0200
Jan Beulich [Fri, 23 May 2014 13:41:34 +0000 (15:41 +0200)]
VT-d: extend error report masking workaround to newer chipsets
Add two more PCI IDs to the set that has been taken care of with a
different workaround long before XSA-59, and (for constency with the
newer workarounds) log a message here too.
Also move the function wide comment to the cases it applies to; this
should really have been done by d061d200 ("VT-d: suppress UR signaling
for server chipsets").
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 04734664eb20c3bf239e473af182bb7ab901d779
master date: 2014-05-20 15:54:01 +0200
Jan Beulich [Fri, 23 May 2014 13:40:47 +0000 (15:40 +0200)]
VT-d: apply quirks at device setup time rather than only at boot
Accessing extended config space may not be possible at boot time, e.g.
when the memory space used by MMCFG is reserved only via ACPI tables,
but not in the E820/UEFI memory maps (which we need Dom0 to tell us
about). Consequently the change here still leaves the issue unaddressed
for systems where the extended config space remains inaccessible (due
to firmware bugs, i.e. not properly reserving the address space of
those regions).
With the respective messages now potentially getting logged more than
once, we ought to consider whether we should issue them only if we in
fact were required to do any masking (i.e. if the relevant mask bits
weren't already set).
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 5786718fbaafbe47d72cc1512cd93de79b8fc2fa
master date: 2014-05-20 15:53:20 +0200
Jan Beulich [Fri, 23 May 2014 13:35:45 +0000 (15:35 +0200)]
VT-d: suppress UR signaling for desktop chipsets
Unsupported Requests can be signaled for malformed writes to the MSI
address region, e.g. due to buggy or malicious DMA set up to that
region. These should normally result in IOMMU faults, but don't on
the desktop chipsets dealt with here.
This is CVE-2013-3495 / XSA-59.
Note that in the backported version the clearing of the fixmap entry
is commented out - it's not strictly needed, as we don't re-use the
fixmap slot (and if we did it would still caue no problems), but causes
a problem in map_pages_to_xen(), which wants to flush the cache for the
page in question, but that works only when the page is still mapped.
Fixing this will need to be a separate patch (coming through -unstable)
though.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Don Dugger <donald.d.dugger@intel.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
master commit: d6cb14b34ffc2a830022d059f1aa22bf19dcf55f
master date: 2014-04-25 12:12:38 +0200
Boris Ostrovsky [Fri, 23 May 2014 13:33:44 +0000 (15:33 +0200)]
x86: use native RDTSC(P) execution when guest and host frequencies are the same
We should be able to continue using native RDTSC(P) execution on
HVM/PVH guests after migration if host and guest frequencies are
equal (this includes the case when the frequencies are made equal
by TSC scaling feature).
This also allows us to revert main part of commit 4aab59a3 (svm: Do not
intercept RDTSC(P) when TSC scaling is supported by hardware) which
was wrong: while RDTSC intercepts were disabled domain's vtsc could
still be set, leading to inconsistent view of guest's TSC.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 82713ec8d2b65d17f13e46a131e38bfe5baf8bd6
master date: 2014-04-22 12:07:37 +0200
Andrew Cooper [Sat, 10 May 2014 01:18:33 +0000 (02:18 +0100)]
tools/pygrub: Fix error handling if no valid partitions are found
If no partitions at all are found, pygrub never creates the name 'fs',
resulting in a NameError indicating the lack of fs, rather than a
RuntimeError explaining that no partitions were found.
Set fs to None right at the start, and use the pythonic idiom "if fs is None:"
to protect against otherwise valid values for fs which compare equal to
0/False.
do_tmem_destroy_pool is checking if pools == NULL. But, pools is a fixed
array.
Clang 3.5 will fail to compile xen/common/tmem.c with the following error:
tmem.c:1848:18: error: comparison of array 'client->pools' equal to a null
pointer is always false [-Werror,-Wtautological-pointer-compare]
if ( client->pools == NULL )
Jan Beulich [Mon, 12 May 2014 15:43:00 +0000 (17:43 +0200)]
x86: fix guest CPUID handling
The way XEN_DOMCTL_set_cpuid got handled so far allowed for surprises
to the caller. With this set of operations
- set leaf A (using array index 0)
- set leaf B (using array index 1)
- clear leaf A (clearing array index 0)
- set leaf B (using array index 0)
- clear leaf B (clearing array index 0)
the entry for leaf B at array index 1 would still be in place, while
the caller would expect it to be cleared.
While looking at the use sites of d->arch.cpuid[] I also noticed that
the allocation of the array needlessly uses the zeroing form - the
relevant fields of the array elements get set in a loop immediately
following the allocation.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 4c0ff6bd54b5a67f8f820f9ed0a89a79f1a26a1c
master date: 2014-05-02 12:09:03 +0200
Paul Durrant [Mon, 12 May 2014 15:42:33 +0000 (17:42 +0200)]
hvm_set_ioreq_page() releases wrong page in error path
The function calls prepare_ring_for_helper() to acquire a mapping for the
given gmfn, then checks (under lock) to see if the ioreq page is already
set up but, if it is, the function then releases the in-use ioreq page
mapping on the error path rather than the one it just acquired. This patch
fixes this bug.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 16e2a7596e9fc86881c73cef57602b2c88155528
master date: 2014-05-02 11:46:32 +0200
Jan Beulich [Mon, 12 May 2014 15:35:18 +0000 (17:35 +0200)]
VT-d: suppress UR signaling for desktop chipsets
Unsupported Requests can be signaled for malformed writes to the MSI
address region, e.g. due to buggy or malicious DMA set up to that
region. These should normally result in IOMMU faults, but don't on
the desktop chipsets dealt with here.
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Don Dugger <donald.d.dugger@intel.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
master commit: d6cb14b34ffc2a830022d059f1aa22bf19dcf55f
master date: 2014-04-25 12:12:38 +0200
Jan Beulich [Mon, 12 May 2014 15:33:41 +0000 (17:33 +0200)]
VT-d: suppress UR signaling for server chipsets
Unsupported Requests can be signaled for malformed writes to the MSI
address region, e.g. due to buggy or malicious DMA set up to that
region. These should normally result in IOMMU faults, but don't on
the server chipsets dealt with here.
IDs 0xe00, 0xe01, and 0xe04 ... 0xe0b (Ivytown) aren't needed here -
Intel confirmed the issue to be fixed in hardware there.
This is CVE-2013-3495 / XSA-59.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Don Dugger <donald.d.dugger@intel.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
master commit: d061d200eb92bcb1d86f9b55c6de73e35ce63fdf
master date: 2014-04-25 12:11:55 +0200
Jan Beulich [Tue, 29 Apr 2014 13:31:28 +0000 (15:31 +0200)]
x86/HVM: restrict HVMOP_set_mem_type
Permitting arbitrary type changes here has the potential of creating
present P2M (and hence EPT/NPT/IOMMU) entries pointing to an invalid
MFN (INVALID_MFN truncated to the respective hardware structure field's
width). This would become a problem the latest when something real sat
at the end of the physical address space; I'm suspecting though that
other things might break with such bogus entries.
Along with that drop a bogus (and otherwise becoming stale) log
message.
Afaict the similar operation in p2m_set_mem_access() is safe.
This is XSA-92.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 83bb5eb4d340acebf27b34108fb1dae062146a68
master date: 2014-04-29 15:11:31 +0200
Jan Beulich [Wed, 9 Apr 2014 09:52:21 +0000 (11:52 +0200)]
VMX: fix PAT value seen by guest
The XSA-60 fixes introduced a window during which the guest PAT gets
forced to all zeros. This shouldn't be visible to the guest. Therefore
we need to intercept PAT MSR accesses during that time period.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Liu Jinsong <jinsong.liu@intel.com>
master commit: fce79f8ce91dc45f3a4d699ee67c49e6cbeb1197
master date: 2014-04-01 16:49:18 +0200
CPUID[80000008].EAX[23:16] have been given the meaning of the guest
physical address restriction (in case it needs to be smaller than the
host's), hence we need to mirror that into vCPUID[80000008].EAX[7:0].
Enforce a lower limit at the same time, as well as a fixed value for
the virtual address bits, and zero for the guest physical address ones.
In order for the vMTRR code to see these overrides we need to make it
call hvm_cpuid() instead of domain_cpuid(), which in turn requires
special casing (and relaxing) the controlling domain.
This additionally should hide an ordering problem in the tools: Both
xend and xl appear to be restoring a guest from its image before
setting up the CPUID policy in the hypervisor, resulting in
domain_cpuid() returning all zeros and hence the check in
mtrr_var_range_msr_set() failing if the guest previously had more than
the minimum 36 physical address bits.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: ef437690af8b75e6758dce77af75a22b63982883
master date: 2014-03-28 13:33:34 +0100
Jan Beulich [Wed, 9 Apr 2014 09:39:08 +0000 (11:39 +0200)]
x86: fix determination of bit count for struct domain allocations
We can't just add in the hole shift value, as the hole may be at or
above the 44-bit boundary. Instead we need to determine the total bit
count until reaching 32 significant (not squashed out) bits in PFN
representations.
Jan Beulich [Wed, 9 Apr 2014 09:38:20 +0000 (11:38 +0200)]
x86/Intel: work around Xeon 7400 series erratum AAI65
Linux commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa ("x86 idle:
Repair large-server 50-watt idle-power regression") tells us that this
applies not just to the named Xeon 7400 series, but also NHM-EX and
WSM-EX; sadly Intel's documentation is so badly searchable that I
wasn't able to locate the respective errata (and hence can't quote
their numbers here).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 96d1b237ae9b2f2718bb1c59820701f17d3d86e0
master date: 2014-03-17 16:47:22 +0100
Jan Beulich [Wed, 9 Apr 2014 09:37:15 +0000 (11:37 +0200)]
VT-d: fix RMRR handling
Removing mapped RMRR tracking structures in dma_pte_clear_one() is
wrong for two reasons: First, these regions may cover more than a
single page. And second, multiple devices (and hence multiple devices
assigned to any particular guest) may share a single RMRR (whether
assigning such devices to distinct guests is a safe thing to do is
another question).
Therefore move the removal of the tracking structures into the
counterpart function to the one doing the insertion -
intel_iommu_remove_device(), and add a reference count to the tracking
structure.
Further, for the handling of the mappings of the respective memory
regions to be correct, RMRRs must not overlap. Add a respective check
to acpi_parse_one_rmrr().
And finally, with all of this being VT-d specific, move the cleanup
of the list as well as the structure type definition where it belongs -
in VT-d specific rather than IOMMU generic code.
Note that this doesn't address yet another issue associated with RMRR
handling: The purpose of the RMRRs as well as the way the respective
IOMMU page table mappings get inserted both suggest that these regions
would need to be marked E820_RESERVED in all (HVM?) guests' memory
maps, yet nothing like this is being done in hvmloader. (For PV guests
this would also seem to be necessary, but may conflict with PV guests
possibly assuming there to be just a single E820 entry representing all
of its RAM.)
Samuel Thibault [Fri, 21 Mar 2014 01:56:56 +0000 (02:56 +0100)]
PV-GRUB: fix blk access at end of disk
GRUB usually always loads a whole disk track, even if that means going
beyond the end of the disk. We thus have to gracefully return an error,
instead of letting the blkfront go panic.
Joby Poriyath [Tue, 4 Feb 2014 18:10:35 +0000 (18:10 +0000)]
xen/pygrub: grub2/grub.cfg from RHEL 7 has new commands in menuentry
menuentry in grub2/grub.cfg uses linux16 and initrd16 commands
instead of linux and initrd. Due to this RHEL 7 (beta) guest failed to
boot after the installation.
In addition to this, RHEL 7 menu entries have two different single-quote
delimited strings on the same line, and the greedy grouping for menuentry
parsing gets both strings, and the options inbetween.
Ian Jackson [Mon, 24 Feb 2014 14:19:15 +0000 (14:19 +0000)]
libxl: Fix carefd lock leak in save callout
If libxl_pipe fails we leave the carefd locked, which translates to
the atfork lock remaining held. This would probably cause the process
to deadlock shortly afterwards.
Of course libxl_pipe is very unlikely to fail unless things are
already going very badly. This bug has not been observed anywhere as
far as we are aware.
Ian Jackson [Mon, 24 Feb 2014 14:19:14 +0000 (14:19 +0000)]
libxl: Hold the atfork lock while closing carefd
This avoids the process being forked while a carefd is recorded in the
list but the actual fd has been closed. If that happened, a
subsequent libxl_postfork_child_noexec would attempt to close the fd
again. If we are lucky that results in a harmless warning; but if we
are unlucky the fd number has been reused and we close an unrelated
fd.
This race has not been observed anywhere as far as we are aware.
Ian Campbell [Fri, 14 Sep 2012 09:02:50 +0000 (10:02 +0100)]
xl: do not leak cpupool names.
Valgrind reports:
==3076== 7 bytes in 1 blocks are definitely lost in loss record 1 of 1
==3076== at 0x402458C: malloc (vg_replace_malloc.c:270)
==3076== by 0x406F86D: libxl_cpupoolid_to_name (libxl_utils.c:102)
==3076== by 0x8058742: parse_config_data (xl_cmdimpl.c:639)
==3076== by 0x805BD56: create_domain (xl_cmdimpl.c:1838)
==3076== by 0x805DAED: main_create (xl_cmdimpl.c:3903)
==3076== by 0x804D39D: main (xl.c:285)
And indeed there are several places where xl uses
libxl_cpupoolid_to_name as a boolean to test if the pool name is
valid and leaks the name if it is. Introduce an is_valid helper and
use that instead.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Juergen Gross<juergen.gross@ts.fujitsu.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 10a194b1c57de7ddc9d4fce07e01f2cd7d0ca26a)
Jan Beulich [Fri, 14 Mar 2014 16:52:27 +0000 (17:52 +0100)]
x86/HVM: consolidate passthrough handling in epte_get_entry_emt()
It is inconsistent to depend on iommu_enabled alone: For a guest
without devices passed through to it, it is of no concern whether the
IOMMU is enabled.
There's one rather special case to take care of: VMX code marks the
LAPIC access page as MMIO. The added assertion needs to take this into
consideration, and the subsequent handling of the direct MMIO case was
inconsistent too: That page would have been WB in the absence of an
IOMMU, but UC in the presence of it, while in fact the cachabilty of
this page is entirely unrelated to an IOMMU being in use.
Jan Beulich [Fri, 14 Mar 2014 16:51:39 +0000 (17:51 +0100)]
x86/HVM: fix memory type merging in epte_get_entry_emt()
Using the minimum numeric value of guest and host specified memory
types is too simplistic - it works only correctly for a subset of
types. It is in particular the WT/WP combination that needs conversion
to UC if the two types conflict.
Dongxiao Xu [Fri, 14 Mar 2014 16:51:07 +0000 (17:51 +0100)]
x86/hvm: refine the judgment on IDENT_PT for EMT
When trying to get the EPT EMT type, the judgment on
HVM_PARAM_IDENT_PT is not correct which always returns WB type if
the parameter is not set. Remove the related code.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
We can't fully drop the dependency yet, but we should certainly avoid
overriding cases already properly handled. The reason for this is that
the guest setting up its MTRRs happens _after_ the EPT tables got
already constructed, and no code is in place to propagate this to the
EPT code. Without this check we're forcing the guest to run with all of
its memory uncachable until something happens to re-write every single
EPT entry. But of course this has to be just a temporary solution.
In the same spirit we should defer the "very early" (when the guest is
still being constructed and has no vCPU yet) override to the last
possible point.
Jan Beulich [Fri, 14 Mar 2014 16:50:03 +0000 (17:50 +0100)]
IOMMU: generalize and correct softirq processing during Dom0 device setup
c/s 21039:95f5a4ce8f24 ("VT-d: reduce default verbosity") having put a
call to process_pending_softirqs() in VT-d's domain_context_mapping()
was wrong in two ways: For one we shouldn't be doing this when setting
up a device during DomU assignment. And then - I didn't check whether
that was the case already back then - we shouldn't call that function
with the pcidevs_lock (or in fact any spin lock) held.
Move the "preemption" into generic code, at once dealing with further
actual (too much output elsewhere - particularly on systems with very
many host bridge like devices - having been observed to still cause the
watchdog to trigger when enabled) and potential (other IOMMU code may
also end up being too verbose) issues.
Do the "preemption" once per device actually being set up when in
verbose mode, and once per bus otherwise.
Note that dropping pcidevs_lock around the process_pending_softirqs()
invocation is specifically not a problem here: We're in an __init
function and aren't racing with potential additions/removals of PCI
devices. Not acquiring the lock in setup_dom0_pci_devices() otoh is not
an option, as there are too many places that assert the lock being
held.
Andrew Cooper [Fri, 14 Mar 2014 16:45:52 +0000 (17:45 +0100)]
x86/mce: Reduce boot-time logspam
When booting with "no-mce", the user does not need to be told that "MCE
support [was] disabled by bootparam" for each cpu. Furthermore, a file:line
reference is unnecessary.
Frediano Ziglio [Fri, 14 Mar 2014 16:44:19 +0000 (17:44 +0100)]
x86/MCE: Fix race condition in mctelem_reserve
These lines (in mctelem_reserve)
newhead = oldhead->mcte_next;
if (cmpxchgptr(freelp, oldhead, newhead) == oldhead) {
are racy. After you read the newhead pointer it can happen that another
flow (thread or recursive invocation) change all the list but set head
with same value. So oldhead is the same as *freelp but you are setting
a new head that could point to whatever element (even already used).
This patch use instead a bit array and atomic bit operations.
Use __func__ to ensure the name remains correct in the future.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
(cherry picked from commit 1671cdeac7da663fb2963f3e587fa279dcd0238b)
Jan Beulich [Thu, 20 Feb 2014 07:41:22 +0000 (08:41 +0100)]
x86: don't drop guest visible state updates when 64-bit PV guest is in user mode
Since 64-bit PV uses separate kernel and user mode page tables, kernel
addresses (as usually provided via VCPUOP_register_runstate_memory_area)
aren't necessarily accessible when the respective updating occurs. Add
logic for toggle_guest_mode() to take care of this (if necessary) the
next time the vCPU switches to kernel mode.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 231d7f4098c8ac9cdb78f18fcb820d8618c8b0c2
master date: 2014-01-23 10:30:08 +0100
Andrew Cooper [Thu, 20 Feb 2014 07:40:24 +0000 (08:40 +0100)]
x86-64/percpu: Force INVALID_PERCPU_AREA into the non-canonical address region
This causes accidental uses of per_cpu() on a pcpu with an INVALID_PERCPU_AREA
to result in a #GF for attempting to access the middle of the non-canonical
virtual address region.
This is preferable to the current behaviour, where incorrect use of per_cpu()
will result in an effective NULL structure dereference which has security
implication in the context of PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 7cfb0053629c4dd1a6f01dc43cca7c0c25b8b7bf
master date: 2013-10-04 12:24:34 +0200
Jan Beulich [Thu, 13 Feb 2014 09:21:42 +0000 (10:21 +0100)]
flask: check permissions first thing in flask_security_set_bool()
Nothing else should be done if the caller isn't permitted to set
boolean values.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: ebe867052e0f782139147015c4e91b37aa5e68f1
master date: 2014-02-11 11:14:10 +0100
Jan Beulich [Thu, 13 Feb 2014 09:21:02 +0000 (10:21 +0100)]
flask: fix error propagation from flask_security_set_bool()
The function should return an error when flask_security_make_bools()
fails as well as when the input ID is out of range.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 31f3620be0e3158c205a3669135f9c4bfa40b1c7
master date: 2014-02-11 11:13:22 +0100
Jan Beulich [Thu, 13 Feb 2014 09:19:20 +0000 (10:19 +0100)]
flask: fix memory leaks
Plus, in the case of security_preserve_bools(), prevent double freeing
in the case of security_get_bools() failing.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 57c9f2caf05de41913b3e4eb48c0c3ad6c18dd3f
master date: 2014-02-11 11:11:48 +0100
Jan Beulich [Thu, 13 Feb 2014 09:18:13 +0000 (10:18 +0100)]
AMD IOMMU: fail if there is no southbridge IO-APIC
... but interrupt remapping is requested (with per-device remapping
tables). Without it, the timer interrupt is usually not working.
Inspired by Linux'es "iommu/amd: Work around wrong IOAPIC device-id in
IVRS table" (commit c2ff5cf5294bcbd7fa50f7d860e90a66db7e5059) by Joerg
Roedel <joerg.roedel@amd.com>.
Reported-by: Eric Houby <ehouby@yahoo.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Eric Houby <ehouby@yahoo.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
master commit: 06bbcaf48d09c18a41c482866941ddd5d2846b44
master date: 2014-02-10 10:05:24 +0100
x86/AMD: apply workaround for AMD F16h erratum 792
Workaround for the Erratum will be in BIOSes spun only after
Jan 2014 onwards. But initial production parts shipped in 2013
itself. Since there is a coverage hole, we should carry this fix
in software in case BIOS does not do the right thing or someone
is using old BIOS.
Description:
Processor does not ensure DRAM scrub read/write sequence is atomic wrt
accesses to CC6 save state area. Therefore if a concurrent scrub
read/write access is to same address the entry may appear as if it is
not written. This quirk applies to Fam16h models 00h-0Fh
See "Revision Guide" for AMD F16h models 00h-0fh, document 51810 rev.
3.04, Nov 2013.
Equivalent Linux patch link:
http://marc.info/?l=linux-kernel&m=139066012217149&w=2
Tested the patch on Fam16h server platform and it works fine.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Corrected checking for boot CPU. Made warning message conditional.
Compacted warning message text. Moved comment to commit message.
Jan Beulich [Thu, 13 Feb 2014 09:16:13 +0000 (10:16 +0100)]
x86/domctl: don't ignore errors from vmce_restore_vcpu()
What started out as a simple cleanup patch (eliminating the redundant
check of domctl->cmd before copying back the output data) revealed a
bug in the handling of XEN_DOMCTL_set_ext_vcpucontext.
Fix this, retaining the cleanup.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: af172d655c3900822d1f710ac13ee38ee9d482d2
master date: 2014-02-04 09:22:12 +0100
Andrew Cooper [Wed, 22 Jan 2014 17:47:21 +0000 (17:47 +0000)]
libxc: Fix out-of-memory error handling in xc_cpupool_getinfo()
Avoid freeing info then returning it to the caller.
This is XSA-88.
Coverity-ID: 1056192 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d883c179a74111a6804baf8cb8224235242a88fc)
libvchan: Fix handling of invalid ring buffer indices
The remote (hostile) process can set ring buffer indices to any value
at any time. If that happens, it is possible to get "buffer space"
(either for writing data, or ready for reading) negative or greater
than buffer size. This will end up with buffer overflow in the second
memcpy inside of do_send/do_recv.
Fix this by introducing new available bytes accessor functions
raw_get_data_ready and raw_get_buffer_space which are robust against
mad ring states, and only return sanitised values.
Proof sketch of correctness:
Now {rd,wr}_{cons,prod} are only ever used in the raw available bytes
functions, and in do_send and do_recv.
The raw available bytes functions do unsigned arithmetic on the
returned values. If the result is "negative" or too big it will be
>ring_size (since we used unsigned arithmetic). Otherwise the result
is a positive in-range value representing a reasonable ring state, in
which case we can safely convert it to int (as the rest of the code
expects).
do_send and do_recv immediately mask the ring index value with the
ring size. The result is always going to be plausible. If the ring
state has become mad, the worst case is that our behaviour is
inconsistent with the peer's ring pointer. I.e. we read or write to
arguably-incorrect parts of the ring - but always parts of the ring.
And of course if a peer misoperates the ring they can achieve this
effect anyway.
So the security problem is fixed.
This is XSA-86.
(The patch is essentially Ian Jackson's work, although parts of the
commit message are by Marek.)
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 2efcb0193bf3916c8ce34882e845f5ceb1e511f7
master date: 2014-02-06 16:44:41 +0100
Jan Beulich [Thu, 6 Feb 2014 16:36:40 +0000 (17:36 +0100)]
flask: fix reading strings from guest memory
Since the string size is being specified by the guest, we must range
check it properly before doing allocations based on it. While for the
two cases that are exposed only to trusted guests (via policy
restriction) this just uses an arbitrary upper limit (PAGE_SIZE), for
the FLASK_[GS]ETBOOL case (which any guest can use) the upper limit
gets enforced based on the longest name across all boolean settings.
This is XSA-84.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 6c79e0ab9ac6042e60434c02e1d99b0cf0cc3470
master date: 2014-02-06 16:33:50 +0100
Frediano Ziglio [Thu, 30 Jan 2014 08:01:01 +0000 (09:01 +0100)]
mce: fix race condition in mctelem_xchg_head
The function (mctelem_xchg_head()) used to exchange mce telemetry
list heads is racy. It may write to the head twice, with the second
write linking to an element in the wrong state.
If there are two threads, T1 inserting on committed list; and T2
trying to consume it.
1. T1 starts inserting an element (A), sets prev pointer (mcte_prev).
2. T1 is interrupted after the cmpxchg succeeded.
3. T2 gets the list and changes element A and updates the commit list
head.
4. T1 resumes, reads pointer to prev again and compare with result
from the cmpxchg which succeeded but in the meantime prev changed
in memory.
5. T1 thinks the cmpxchg failed and goes around the loop again,
linking head to A again.
To solve the race use temporary variable for prev pointer.
*linkp (which point to a field in the element) must be updated before
the cmpxchg() as after a successful cmpxchg the element might be
immediately removed and reinitialized.
The wmb() prior to the cmpchgptr() call is not necessary since it is
already a full memory barrier. This wmb() is thus removed.
Andrew Cooper [Thu, 30 Jan 2014 08:00:09 +0000 (09:00 +0100)]
dbg_rw_guest_mem: need to call put_gfn in error path
Using a 1G hvm domU (in grub) and gdbsx:
(gdb) set arch i8086
warning: A handler for the OS ABI "GNU/Linux" is not built into this configuration
of GDB. Attempting to continue with the default i8086 settings.
The target architecture is assumed to be i8086
(gdb) target remote localhost:9999
Remote debugging using localhost:9999
Remote debugging from host 127.0.0.1
0x0000d475 in ?? ()
(gdb) x/1xh 0x6ae9168b
Will reproduce this bug.
With a debug=y build you will get:
Assertion '!preempt_count()' failed at preempt.c:37
For a debug=n build you will get a dom0 VCPU hung (at some point) in: